
RAG
Implementation-accurate, engineering-grade documentation of a retrieval-augmented generation system for document Q&A over custom knowledge bases: PDF/web/markdown indexing, dual vector stores (Pinecone and FAISS), and optional LLM generation via LangGraph.
Problem
The Challenge
Context
Need to query specific documents (PDFs, web pages, or markdown) with natural language and get answers grounded in that content. No web UI or API; the system is designed for local use via CLI script and Jupyter notebooks.
User Pain Points
Documents must be indexable and searchable by semantic similarity.
Dual workflows: retrieval-only (snippets) vs full RAG (retrieve then generate with LLM).
Why Existing Solutions Failed
Generic search or static docs do not support natural-language Q&A grounded in custom content; retrieval-augmented generation with vector stores and optional LLM meets the need.
Goals & Metrics
What We Set Out to Achieve
Objectives
- 01Index documents (PDF, web, or markdown) into vector stores (Pinecone or FAISS).
- 02Answer user questions via similarity search over indexed chunks.
- 03Optionally generate LLM answers from retrieved context (007_rag.ipynb only).
Success Metrics
- 01rag_doc.py: PDF indexed to Pinecone, CLI prints top-3 snippets per question.
- 02rag.ipynb: Markdown indexed to Pinecone, similarity search returns top-k chunks.
- 03007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator produces state["answer"] with gpt-4.1-nano.
User Flow
User Journey
Indexing: document source → load → split → embed → vector store. Query: user question → similarity search → snippets (rag_doc.py, rag.ipynb) or retrieve→generator→answer (007_rag.ipynb).
Architecture
System Design
Three entrypoints: rag_doc.py (CLI, Pinecone), rag.ipynb (markdown, Pinecone), 007_rag.ipynb (web/PDF, FAISS, LangGraph RAG). Services: Pinecone, Google Gemini, OpenAI. No frontend; no relational DB.
Backend
Services
Databases
Data Flow
How Data Moves
User/PDF/URL → loaders → splitters → embedding (Gemini) → Pinecone/FAISS. User question → vector store → top-k chunks → (optional) generator node → state["answer"].
Core Features
Key Functionality
PDF load and chunk
What it does
Loads a PDF and splits it into text chunks with overlap
Why it matters
rag_doc.py, 007_rag.ipynb
Implementation
PyPDFLoader + RecursiveCharacterTextSplitter (chunk_size 1000, overlap 150 or 200)
Web page load
What it does
Fetches and parses a URL into document chunks
Why it matters
007_rag.ipynb
Implementation
WebBaseLoader with bs4 SoupStrainer (post-title, post-header, post-content)
Markdown header split
What it does
Splits markdown by headers (#, ##, ###) with metadata
Why it matters
rag.ipynb
Implementation
MarkdownHeaderTextSplitter with headers_to_split_on
Google Gemini embeddings
What it does
Produces 768-dim embeddings for chunks and queries
Why it matters
rag_doc.py, rag.ipynb, 007_rag.ipynb
Implementation
GoogleGenerativeAIEmbeddings(model='models/text-embedding-004')
Pinecone index and upsert
What it does
Creates or reuses Pinecone index and stores vectors
Why it matters
rag_doc.py, rag.ipynb
Implementation
Pinecone client, create_index (768 dim, cosine, ServerlessSpec), PineconeVectorStore.from_documents or add_documents
FAISS vector store
What it does
Local vector index for similarity search; save/load to disk
Why it matters
007_rag.ipynb
Implementation
FAISS from langchain_community.vectorstores, save_local/load_local with allow_dangerous_deserialization
Similarity search
What it does
Returns top-k document chunks for a query
Why it matters
rag_doc.py, rag.ipynb, 007_rag.ipynb
Implementation
vector_store.similarity_search or similarity_search_with_score(query, k=2..4)
RAG agent (retrieve + generate)
What it does
Retrieves context then generates answer with LLM
Why it matters
007_rag.ipynb
Implementation
StateGraph(State) with nodes retrieve and generator; retrieve similarity_search k=4; generator invokes init_chat_model('gpt-4.1-nano', 'openai'), returns answer
CLI Q&A loop
What it does
Interactive terminal: user types question, system prints top snippets (no LLM answer)
Why it matters
rag_doc.py
Implementation
while True input loop, similarity_search_with_score(user_query, k=3), print score/page/content; exit on exit/quit/q
Technical Challenges
Problems We Solved
Why This Was Hard
Missing or invalid PDF path, API errors, empty chunks, missing env vars surface as unhandled exceptions.
Our Solution
Only defensive logic in analyzed code: index existence check before Pinecone create (rag_doc.py).
Why This Was Hard
Re-running rag_doc.py with the same index re-upserts; no "index only if empty" or idempotent upsert.
Our Solution
Not addressed in code; future improvement: optional conditional upsert or idempotent strategy.
Why This Was Hard
Omits pinecone, python-dotenv, langchain_pinecone despite use; complicates reproducible installs.
Our Solution
Not addressed; add missing dependencies with version pins.
Engineering Excellence
Performance, Security & Resilience
Performance
- Chunk size and overlap tuned (1000/150 or 1000/200); top-k limited (2–4) to bound context size.
- Pinecone serverless for managed scale; FAISS for local fast ANN.
- No caching of embeddings or LLM responses; single-threaded.
Error Handling
- Index existence check before Pinecone create (rag_doc.py).
- No try/except or explicit error handling in analyzed code.
Security
- API keys from environment (load_dotenv); no secrets in repo.
- No input validation or sanitization on user questions; no rate limiting or auth for CLI/notebook.
- FAISS load_local with allow_dangerous_deserialization (documented risk).
Design Decisions
Visual & UX Choices
CLI
Rationale
rag_doc.py: prompt "User: ", print snippets with score and page, 250-char content preview.
Details
Sequential loop: question → print snippets → repeat; exit on exit/quit/q.
Notebook
Rationale
Cell-by-cell execution; output of documents and search results in notebook.
Details
007_rag.ipynb: graph.invoke({"question": "..."}) returns final state with answer.
Impact
The Result
What We Achieved
Three entrypoints: (1) rag_doc.py indexes a PDF into Pinecone with Google Gemini embeddings and runs an interactive CLI that prints top-3 snippets per question (no LLM answer). (2) rag.ipynb indexes markdown into Pinecone and demonstrates similarity search. (3) 007_rag.ipynb loads web or PDF into FAISS and runs a LangGraph RAG agent (retrieve → generator) with OpenAI gpt-4.1-nano to produce answers. When env and APIs are valid, indexing and retrieval work as designed; LLM-based answers only in 007_rag.ipynb.
Who It Helped
Solo project; Pinecone, Google Gemini, and OpenAI provide vector index, embeddings, and LLM.
Why It Matters
Implementation-accurate RAG pipelines for PDF, web, and markdown with dual vector-store support (Pinecone, FAISS), single embedding model (Gemini), and optional LLM generation via a two-node LangGraph in one notebook. Design favors clarity and local/exploratory use over production hardening.
Verification
Measurable Outcomes
Each outcome verified against reference implementations or test suites.
rag_doc.py: PDF → Pinecone, CLI prints top-3 snippets per question
rag.ipynb: Markdown → Pinecone, similarity search
007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator with gpt-4.1-nano
Reflections
Key Learnings
Technical Learnings
- Load → split → embed → store → retrieve → (optional) generate pipeline is explicit across scripts and notebook.
- Dual vector stores (Pinecone vs FAISS) allow cloud persistence vs local fast ANN per workflow.
Architectural Insights
- Two-node LangGraph (retrieve → generator) in 007_rag.ipynb makes RAG flow explicit; state carries question, context, answer.
- No web framework or API; CLI and notebook only keeps scope local and exploratory.
What I'd Improve
- Add error handling, requirements.txt gaps (pinecone, python-dotenv, langchain_pinecone), optional "index only if empty" upsert, input validation if exposed beyond local use.
Roadmap
Future Enhancements
Add try/except for missing file, network/API errors, empty chunk list, missing env vars; surface clear messages or exit codes.
Add pinecone, python-dotenv, langchain_pinecone to requirements.txt with version pins.
Optionally "index only if empty" or idempotent upsert to avoid redundant re-indexing.
Validate or sanitize user questions before embedding and LLM call; consider rate limiting if exposed beyond local use.
Consider moving LangGraph retrieve→generator flow into rag_doc.py or a shared module so CLI can optionally return an LLM answer.
Document or enforce loading FAISS only from trusted paths; consider alternatives to allow_dangerous_deserialization if loading untrusted indices is ever required.
If the system is ever served (API or web UI), add Dockerfile, env documentation, and deployment config.
