RAG (Retrieval-Augmented Generation)
Context
Section titled “Context”Your customer support bot is hallucinating. It invents features that don’t exist and quotes prices from two years ago. The model isn’t stupid — it simply doesn’t have access to your current product data. Your CTO says: “We need RAG.”
RAG (Retrieval-Augmented Generation) is the primary pattern for giving AI features access to external, up-to-date data. Instead of retraining the model, relevant documents are injected into the prompt at runtime. For PMs, this is the most important architectural decision after model selection — and the most common source of avoidable quality problems.
Concept
Section titled “Concept”The RAG Pipeline: Embed, Store, Retrieve, Generate
Section titled “The RAG Pipeline: Embed, Store, Retrieve, Generate”Step 1 — Embed (Indexing): Documents are split into chunks and converted into vectors — high-dimensional numerical representations of meaning. “Car” and “automobile” have similar vectors because they mean similar things. Common embedding models: OpenAI text-embedding-3-large, Cohere embed-v4, open-source alternatives (BGE, E5).
Step 2 — Store: Embeddings are stored alongside the original text in a vector database. Major options (2026): Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector (PostgreSQL extension). PM tip: pgvector is the simplest starting point for teams already using PostgreSQL.
Step 3 — Retrieve: The user query is embedded using the same model. Vector similarity search finds the most relevant chunks (typically top 3-10). Best practice 2026: hybrid search — vector similarity combined with keyword/BM25 search. Catches cases where semantic search misses exact terms (product names, error codes).
Step 4 — Generate: Retrieved chunks are injected as context into the LLM prompt. The model generates a response grounded in the retrieved information. Well-designed RAG systems cite which chunks informed the answer.
Chunking — the Underestimated Quality Lever
Section titled “Chunking — the Underestimated Quality Lever”Chunking quality is the single biggest determinant of RAG quality. Bad chunks mean bad retrieval, and bad retrieval means bad answers.
| Strategy | How it works | Best for | Complexity |
|---|---|---|---|
| Fixed-size | Split every N tokens (e.g., 512) with overlap | Simple documents, getting started | Low |
| Recursive | Split by paragraphs, then sentences, then tokens | General-purpose default | Low |
| Semantic | Split at topic/meaning boundaries | Long documents with topic shifts | Medium |
| Heading-aware | Split by document structure (H1, H2, sections) | Structured docs, manuals | Medium |
| Contextual | LLM-generated context prepended to each chunk | Highest retrieval quality | High |
Best practice: Start with recursive chunking at 512 tokens and 10-20% overlap. Measure retrieval quality. Only move to semantic or contextual chunking after establishing a baseline.
What PMs Get Wrong
Section titled “What PMs Get Wrong”- “RAG eliminates hallucinations.” False. RAG reduces them, but the model can still hallucinate beyond retrieved content or misinterpret irrelevant chunks.
- “More data = better RAG.” False. Indexing irrelevant or low-quality documents increases noise and reduces retrieval precision. Curation matters more than expansion.
- “Vector search is all you need.” False. Hybrid search (vector + keyword) is the 2026 default because pure vector search misses exact matches.
Framework
Section titled “Framework”RAG architecture decision tree:
| Step | Action | When to escalate |
|---|---|---|
| 1. Baseline | Recursive chunking (512 tokens, 10% overlap) + single vector store | When retrieval precision falls below 80% |
| 2. Hybrid | Combine vector + keyword/BM25 search | When exact-term queries fail |
| 3. Reranking | Cross-encoder re-scores initial 20-50 candidates | When precision matters more than latency |
| 4. GraphRAG | Build knowledge graph from corpus | Only when cross-document reasoning is a core requirement |
RAG quality metrics PMs must track:
| Metric | What it measures | Why it matters |
|---|---|---|
| Retrieval Precision | Are retrieved chunks relevant? | Irrelevant chunks degrade answers |
| Retrieval Recall | Are all relevant chunks found? | Missing chunks lead to incomplete answers |
| Answer Faithfulness | Does the answer stick to retrieved content? | Detect hallucination despite RAG |
| Answer Relevance | Does the answer address the question? | Quality from the user’s perspective |
Scenario
Section titled “Scenario”You’re a PM at an HR tech SaaS (B2B, 500 enterprise customers). Your next feature: AI-powered access to each customer’s company-specific knowledge base — policies, handbooks, onboarding docs.
The situation:
- Each customer has 200-5,000 documents (PDF, Word, Confluence)
- 80% of queries are about specific policy details (“How many vacation days do I get after 3 years?”)
- Requirement: source citation with every answer (compliance)
- Budget: $3,000/month for AI infrastructure
- Data privacy: tenant separation is mandatory — Customer A must never see Customer B’s data
Options:
- Long context: Pack all documents into the context window with every query. No vector store needed
- Basic RAG: pgvector + recursive chunking + simple vector search
- Production RAG: Pinecone + hybrid search + reranking + tenant-separated namespaces + source attribution
Decide
Section titled “Decide”How would you decide?
The best decision: Option 3 — Production RAG, but start with Option 2 as MVP.
Why:
- Long context is not a solution: 5,000 documents don’t fit in a context window. Even with 200 documents, the cost per query would be astronomical — and quality degrades with context length
- Source citation is a hard requirement: RAG with source attribution is the only pattern that delivers compliance-grade citations. Long context cannot reliably identify which part of the input informed the answer
- Tenant separation: Pinecone namespaces (or pgvector with row-level security) solve the tenant separation problem architecturally
- Hybrid search for policy documents: Policy queries often contain exact terms (“Section 4.2”, “vacation policy 2025”). Pure vector search misses these; hybrid search catches them
- MVP path: Start with pgvector + recursive chunking (Option 2) in weeks 1-2. Measure retrieval quality. Migrate to Pinecone + hybrid + reranking (Option 3) once the baseline is established
Common mistake: Starting with the most complex RAG architecture without having a quality baseline. You don’t know whether reranking adds 5% or 30% until you’ve measured basic RAG.
Reflect
Section titled “Reflect”- RAG is the primary pattern for AI features that need proprietary or current data. It doesn’t replace fine-tuning (RAG provides knowledge, fine-tuning changes behavior), but it solves the most common problem: “The model doesn’t know our data.”
- Chunking quality determines RAG quality. Start with recursive chunking, measure results, and escalate to more complex strategies only then.
- Hybrid search (vector + keyword) is the 2026 standard — not optional when exact terms matter.
- RAG reduces hallucinations but doesn’t eliminate them. Actively measuring answer faithfulness is mandatory.
Sources: Pinecone RAG Architecture Guide, PMC Comparative Evaluation of Advanced Chunking for RAG (2025), Neo4j Advanced RAG Techniques, Eden AI 2025 Guide to RAG, Morphik RAG Strategies at Scale