RAG

Implementation-accurate, engineering-grade documentation of a retrieval-augmented generation system for document Q&A over custom knowledge bases: PDF/web/markdown indexing, dual vector stores (Pinecone and FAISS), and optional LLM generation via LangGraph.

Role:Software Engineer

Year:—

PythonLangChain (community, text_splitters, google_genai, openai, pinecone, core, chat_models)LangGraph (StateGraph)PineconeFAISSGoogle Gemini (text-embedding-004)OpenAI (gpt-4.1-nano)PyPDFLoader / WebBaseLoaderRecursiveCharacterTextSplitterMarkdownHeaderTextSplitterJupyter

Problem

The Challenge

Context

Need to query specific documents (PDFs, web pages, or markdown) with natural language and get answers grounded in that content. No web UI or API; the system is designed for local use via CLI script and Jupyter notebooks.

User Pain Points

Documents must be indexable and searchable by semantic similarity.

Dual workflows: retrieval-only (snippets) vs full RAG (retrieve then generate with LLM).

Why Existing Solutions Failed

Generic search or static docs do not support natural-language Q&A grounded in custom content; retrieval-augmented generation with vector stores and optional LLM meets the need.

Goals & Metrics

What We Set Out to Achieve

Objectives

01Index documents (PDF, web, or markdown) into vector stores (Pinecone or FAISS).
02Answer user questions via similarity search over indexed chunks.
03Optionally generate LLM answers from retrieved context (007_rag.ipynb only).

Success Metrics

01rag_doc.py: PDF indexed to Pinecone, CLI prints top-3 snippets per question.
02rag.ipynb: Markdown indexed to Pinecone, similarity search returns top-k chunks.
03007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator produces state["answer"] with gpt-4.1-nano.

Loading diagram...

User Flow

User Journey

Indexing: document source → load → split → embed → vector store. Query: user question → similarity search → snippets (rag_doc.py, rag.ipynb) or retrieve→generator→answer (007_rag.ipynb).

start

Start

action

Load documents (PDF path, URL, or markdown)

action

Split and embed (Gemini)

action

Store in Pinecone or FAISS

action

User asks question (CLI or graph.invoke)

action

Similarity search top-k chunks

action

Print snippets or generator → answer

end

End

Loading diagram...

Architecture

System Design

Three entrypoints: rag_doc.py (CLI, Pinecone), rag.ipynb (markdown, Pinecone), 007_rag.ipynb (web/PDF, FAISS, LangGraph RAG). Services: Pinecone, Google Gemini, OpenAI. No frontend; no relational DB.

Backend

rag_doc.py: CLI loop, PyPDFLoader, RecursiveCharacterTextSplitter, Pinecone upsert, similarity_search_with_scorerag.ipynb: MarkdownHeaderTextSplitter, Pinecone add_documents, similarity search007_rag.ipynb: WebBaseLoader/PyPDFLoader, FAISS, StateGraph retrieve→generator with gpt-4.1-nano

Services

Pinecone (vector index, serverless AWS us-east-1)Google Gemini (text-embedding-004, 768 dim)OpenAI (gpt-4.1-nano in 007_rag.ipynb)

Databases

Pinecone vector index (luxdit-paper-index, testing-pinecone-gemini)FAISS local vector store (in-memory or load_local)

Loading diagram...

Data Flow

How Data Moves

User/PDF/URL → loaders → splitters → embedding (Gemini) → Pinecone/FAISS. User question → vector store → top-k chunks → (optional) generator node → state["answer"].

User/PDF/URL → Document loaders

File path or URL; trigger: script run or notebook execution

Document loaders → Text splitters

List of Document objects; trigger: after load

Text splitters → Embedding model

Chunk text; trigger: before vector store write

Embedding model → Pinecone or FAISS

Vectors and metadata; trigger: from_documents / add_documents or FAISS build

User question → Vector store

Query string; trigger: user input in CLI or graph.invoke

Vector store → Retrieve node / caller

Top-k Document chunks (and optional scores)

Retrieve node → Generator node

State with question and context (doc list); trigger: LangGraph edge

Generator node → User

state['answer'] (LLM response); trigger: graph.invoke in 007_rag.ipynb

Loading diagram...

Core Features

while True input loop, similarity_search_with_score(user_query, k=3), print score/page/content; exit on exit/quit/q

Technical Challenges

Problems We Solved

Engineering Excellence

Performance, Security & Resilience

⚡

Performance

Chunk size and overlap tuned (1000/150 or 1000/200); top-k limited (2–4) to bound context size.
Pinecone serverless for managed scale; FAISS for local fast ANN.
No caching of embeddings or LLM responses; single-threaded.

🛡️

Error Handling

Index existence check before Pinecone create (rag_doc.py).
No try/except or explicit error handling in analyzed code.

🔒

Security

API keys from environment (load_dotenv); no secrets in repo.
No input validation or sanitization on user questions; no rate limiting or auth for CLI/notebook.
FAISS load_local with allow_dangerous_deserialization (documented risk).

Loading diagram...

Design Decisions

Visual & UX Choices

CLI

Rationale

rag_doc.py: prompt "User: ", print snippets with score and page, 250-char content preview.

Details

Sequential loop: question → print snippets → repeat; exit on exit/quit/q.

Notebook

Rationale

Cell-by-cell execution; output of documents and search results in notebook.

Details

007_rag.ipynb: graph.invoke({"question": "..."}) returns final state with answer.

Impact

The Result

✓

What We Achieved

Three entrypoints: (1) rag_doc.py indexes a PDF into Pinecone with Google Gemini embeddings and runs an interactive CLI that prints top-3 snippets per question (no LLM answer). (2) rag.ipynb indexes markdown into Pinecone and demonstrates similarity search. (3) 007_rag.ipynb loads web or PDF into FAISS and runs a LangGraph RAG agent (retrieve → generator) with OpenAI gpt-4.1-nano to produce answers. When env and APIs are valid, indexing and retrieval work as designed; LLM-based answers only in 007_rag.ipynb.

👥

Who It Helped

Solo project; Pinecone, Google Gemini, and OpenAI provide vector index, embeddings, and LLM.

⭐

Why It Matters

Implementation-accurate RAG pipelines for PDF, web, and markdown with dual vector-store support (Pinecone, FAISS), single embedding model (Gemini), and optional LLM generation via a two-node LangGraph in one notebook. Design favors clarity and local/exploratory use over production hardening.

Verification

Measurable Outcomes

Each outcome verified against reference implementations or test suites.

rag_doc.py: PDF → Pinecone, CLI prints top-3 snippets per question

rag.ipynb: Markdown → Pinecone, similarity search

007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator with gpt-4.1-nano

Reflections

Key Learnings

Technical Learnings

Load → split → embed → store → retrieve → (optional) generate pipeline is explicit across scripts and notebook.
Dual vector stores (Pinecone vs FAISS) allow cloud persistence vs local fast ANN per workflow.

Architectural Insights

Two-node LangGraph (retrieve → generator) in 007_rag.ipynb makes RAG flow explicit; state carries question, context, answer.
No web framework or API; CLI and notebook only keeps scope local and exploratory.

What I'd Improve

Add error handling, requirements.txt gaps (pinecone, python-dotenv, langchain_pinecone), optional "index only if empty" upsert, input validation if exposed beyond local use.

Roadmap

Future Enhancements

Add try/except for missing file, network/API errors, empty chunk list, missing env vars; surface clear messages or exit codes.

Add pinecone, python-dotenv, langchain_pinecone to requirements.txt with version pins.

Optionally "index only if empty" or idempotent upsert to avoid redundant re-indexing.

Validate or sanitize user questions before embedding and LLM call; consider rate limiting if exposed beyond local use.

Consider moving LangGraph retrieve→generator flow into rag_doc.py or a shared module so CLI can optionally return an LLM answer.

Document or enforce loading FAISS only from trusted paths; consider alternatives to allow_dangerous_deserialization if loading untrusted indices is ever required.

If the system is ever served (API or web UI), add Dockerfile, env documentation, and deployment config.

Next Project

Crypto Prediction & Market Analysis System

A low-latency crypto analytics and forecasting API built with FastAPI and Redis, optimized for real-time dashboards via WebSockets, scheduled data prefetching, and signal enrichment (technical indicators + LLM sentiment).

RAG

RAG

The Challenge

Context

User Pain Points

Why Existing Solutions Failed

What We Set Out to Achieve

Objectives

Success Metrics

User Journey

System Design

Backend

Services

Databases

How Data Moves

Key Functionality

PDF load and chunk

What it does

Why it matters

Implementation

Web page load

What it does

Why it matters

Implementation

Markdown header split

What it does

Why it matters

Implementation

Google Gemini embeddings

What it does

Why it matters

Implementation

Pinecone index and upsert

What it does

Why it matters

Implementation

FAISS vector store

What it does

Why it matters

Implementation

Similarity search

What it does

Why it matters

Implementation

RAG agent (retrieve + generate)

What it does

Why it matters

Implementation

CLI Q&A loop

What it does

Why it matters

Implementation

Problems We Solved

No explicit error handling

Why This Was Hard

Our Solution

Re-upsert on re-run

Why This Was Hard

Our Solution

requirements.txt gaps

Why This Was Hard

Our Solution

Performance, Security & Resilience

Performance

Error Handling

Security

Visual & UX Choices

CLI

Rationale

Details

Notebook

Rationale

Details

The Result

What We Achieved

Who It Helped

Why It Matters

Measurable Outcomes

Key Learnings

Technical Learnings