ai / rag

RAG Architecture

Grounding LLM responses in your own verified knowledge base instead of model memory

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native

💡

In Plain English

Think of an LLM as a very well-read generalist. RAG is like giving that generalist access to your company's private filing cabinet just before they answer your question. Instead of relying on what they learned in training (which may be outdated or wrong), they read the relevant documents first, then answer. The result is accurate, current, and source-cited.

📈

Business Value

RAG reduces LLM hallucination rates by 60–80% compared to prompt-only approaches, enables AI systems to use private knowledge (contracts, internal policies, product documentation) that was never in training data, and makes AI answers auditable by showing which source documents were used.

📖 Detailed Explanation

Retrieval-Augmented Generation (RAG) is the dominant architecture pattern for deploying LLMs in enterprise settings. The core insight is that LLM parametric memory — knowledge baked into model weights during training — is static, opaque, and prone to confident confabulation. RAG replaces or supplements parametric memory with dynamic retrieval from a vector store, making the knowledge base auditable, updatable, and domain-specific.

The RAG Pipeline has four phases: Ingestion, Retrieval, Augmentation, and Generation. Ingestion: documents are chunked, embedded into vector representations, and stored in a vector database (Pinecone, Weaviate, pgvector, Qdrant). Retrieval: the user query is embedded using the same model, and nearest-neighbor search finds the most semantically similar document chunks. Augmentation: the retrieved chunks are injected into the LLM prompt as context. Generation: the LLM generates a response grounded in the retrieved context, ideally citing its sources.

Chunking Strategy is the most underestimated design decision in RAG. Naive fixed-size chunking (split every 512 tokens) breaks semantic units — a function definition may be split from its explanation, a numbered list may be severed mid-step. Semantic chunking uses sentence boundary detection and paragraph structure. Hierarchical chunking stores multiple granularities: fine-grained chunks for retrieval, parent sections for context window injection. The choice of chunking strategy can swing end-to-end answer quality by 30–40% independently of the LLM choice.

Embedding Model Selection requires domain-specific evaluation. General-purpose models (OpenAI text-embedding-ada-002, Cohere embed, Google gecko) perform well on broad English text but degrade on technical, legal, or Tagalog/Filipino content. Always evaluate candidate embedding models on a golden question-answer dataset drawn from your actual domain before committing. For Philippine banking deployments, evaluate multilingual embedding models that handle Filipino financial terminology.

Hybrid Search consistently outperforms pure vector search in production. Dense vector search (semantic similarity) excels at conceptual queries: "What is our policy on data retention?" Sparse keyword search (BM25, TF-IDF) excels at specific term lookup: "BSP Circular 982 section 4.3." Reciprocal Rank Fusion (RRF) combines the ranked results from both methods. Cohere's benchmark data shows hybrid search improves precision@5 by 15–25% over pure vector search across enterprise document corpora.

Reranking is a second retrieval stage that significantly improves precision. After retrieving the top-20 candidates via hybrid search, a cross-encoder reranker (Cohere Rerank, BGE Reranker, MS MARCO) rescores all 20 by actual relevance to the query — not just vector similarity. The top 3–5 reranked results go into the prompt. This two-stage approach (fast retrieval → precise reranking) achieves near-gold-standard precision at manageable latency.

Evaluation is mandatory and often skipped. Build a golden evaluation dataset of 200–500 question-answer pairs from your knowledge base. Evaluate three metrics independently: Context Recall (did retrieval surface the right documents?), Answer Faithfulness (does the answer only use retrieved content, no hallucination?), and Answer Relevance (does the answer address the question?). The RAGAS framework automates all three. Track these metrics on every pipeline change — a better embedding model may help Context Recall while hurting Answer Faithfulness.

Production Considerations: metadata filtering prevents date-expired documents from being retrieved (use document date and status fields as filter parameters). Streaming responses improve perceived latency for long answers. Caching of frequent query embeddings and retrieved contexts reduces cost significantly. And critically: log every query, every retrieved chunk, and every response for audit and quality improvement.

📈 Architecture Diagram

flowchart LR
    subgraph INGESTION["📥 Ingestion Pipeline"]
        A[Source Documents
PDFs, Confluence, Wikis] --> B[Chunking
Semantic / Hierarchical]
        B --> C[Embedding Model
text-embedding-3-large]
        C --> D[(Vector Store
Pinecone / pgvector)]
        A --> E[(Metadata Store
date, source, classification)]
    end
    subgraph RETRIEVAL["🔍 Retrieval"]
        F[User Query] --> G[Query Embedding]
        G --> H[Dense Search
Vector Similarity]
        G --> I[Sparse Search
BM25 / TF-IDF]
        H --> J[RRF Fusion
+ Metadata Filter]
        I --> J
        J --> K[Reranker
Cross-Encoder]
        K --> L[Top-3 Chunks
+ Source Citations]
    end
    subgraph GENERATION["✨ Generation"]
        L --> M[Augmented Prompt
System + Context + Query]
        M --> N[LLM
claude-3-5-sonnet]
        N --> O[Grounded Response
with Citations]
    end
    D --> H
    E --> J
    style INGESTION fill:#0f172a,color:#f8fafc
    style RETRIEVAL fill:#1e1b4b,color:#f8fafc
    style GENERATION fill:#052e16,color:#f8fafc

Production RAG architecture showing the full pipeline: document ingestion with semantic chunking, hybrid search retrieval with reranking, and grounded generation with source citations.

🌎 Real-World Examples

Notion AI — Workspace Q&A

San Francisco, USA · Productivity Software · 100M+ pages indexed

Notion AI uses RAG to ground GPT-4 responses in users' actual workspaces. Their chunking splits at Notion block boundaries (the native content unit) rather than arbitrary characters — a semantic boundary perfectly aligned to user-created content. Hybrid search combines their existing BM25 full-text index with dense vector embeddings. For 100,000-page enterprise workspaces, retrieval p99 latency is 120ms including reranking.

✓ Result: Hallucination rate dropped from 22% (LLM-only) to 4.1% (RAG); enterprise NPS for AI features up 31 points

Perplexity AI — Agentic RAG Search

San Francisco, USA · AI Search Engine · 4M+ daily users

Perplexity decomposes complex queries into sub-queries, runs parallel web retrievals, applies cross-encoder reranking across sources, and synthesizes cited responses in < 3 seconds. Multi-hop reasoning: sub-answer 1 informs retrieval query 2. Every response has numbered citations linked to source documents — the production standard for attribution.

✓ Result: 82% accuracy on knowledge-intensive queries vs. 67% for Google Search; 4M+ daily users

HSBC — Regulatory Document Intelligence

London, UK · Global Banking · 2,400+ regulatory pages

HSBC deployed RAG to help compliance teams navigate 2,400+ pages of FCA, PRA, and Basel IV documents. Chunking splits at regulatory paragraph boundaries (e.g., 'Article 5(2)(b)' as a single chunk). Metadata filters restrict retrieval to jurisdiction-relevant and in-force documents — preventing responses based on superseded regulations.

✓ Result: Regulatory query resolution time reduced from 4.5 hours to 22 minutes; compliance analyst capacity 3× without headcount increase

Siemens — Industrial Knowledge Assistant

Munich, Germany · Manufacturing · 80,000+ manual pages

Siemens RAG for field engineers covers 80,000+ pages of technical manuals in German, English, Chinese, and Japanese. BGE-M3 multilingual embeddings handle cross-language retrieval. Hierarchical chunking preserves component → subassembly → procedure context critical for correct service procedures.

✓ Result: First-time fix rate for field repairs: 67% → 89%; 40% reduction in escalations to Germany HQ

🌟 Core Principles

Retrieval Quality is the Ceiling on Generation Quality

No LLM can generate a correct answer from incorrect or missing context. Invest disproportionately in retrieval quality — chunking strategy, embedding model selection, hybrid search, and reranking — before tuning the generation step.

Chunk Boundaries Must Respect Semantic Units

A chunk should represent one complete thought: a full step, a complete definition, a whole code block. Chunks that split semantic units at arbitrary token boundaries degrade retrieval quality significantly.

Evaluate Retrieval and Generation Separately

Conflating retrieval and generation metrics hides root causes. Track Context Recall (retrieval), Answer Faithfulness (no hallucination), and Answer Relevance (addresses the question) as three independent metrics.

Hybrid Search Over Pure Vector

Combining dense vector search with sparse keyword search (BM25) consistently outperforms either alone. Deploy Reciprocal Rank Fusion to combine ranked results from both methods.

Metadata Filtering Prevents Stale Answers

Vector similarity does not know that a document was superseded six months ago. Add document date, status, and classification as metadata fields. Apply filters at retrieval time to exclude expired or restricted content.

⚙️ Implementation Steps

Build the Golden Evaluation Dataset First

Before ingesting a single document, create 200–500 question-answer pairs from your knowledge base. This dataset is your measurement instrument — all pipeline decisions (chunking, embedding model, retrieval strategy) are evaluated against it.

Design the Ingestion Pipeline

Source documents → pre-processing (format normalization, table extraction) → chunking (semantic or hierarchical) → embedding → vector store + metadata store. Each stage should be independently replaceable.

Implement and Validate Hybrid Search

Deploy both dense (vector) and sparse (BM25) search. Measure precision@5 for each independently on your golden dataset. Implement RRF fusion and measure the combined precision. Expect a 15–25% improvement over vector-only.

Add the Reranking Stage

After hybrid search returns top-20 candidates, deploy a cross-encoder reranker to rescore by true relevance. Evaluate the reranked top-5 against your golden dataset. This is typically the highest single-step quality improvement available.

Instrument the Full Pipeline

Log every query, retrieved chunks, reranked scores, and generated response. Use an evaluation framework (RAGAS, TruLens, DeepEval) for continuous monitoring. Set quality alerts: if Answer Faithfulness drops below 85%, trigger an investigation.

✅ Governance Checkpoints

Checkpoint	Owner	Gate Criteria	Status
Golden Dataset Created	AI Engineer	200+ question-answer pairs covering key use cases	Required
Chunking Strategy Validated	AI Architect	Retrieval recall benchmarked on golden dataset per chunking strategy	Required
Hybrid Search + Reranking Deployed	AI Engineer	Combined pipeline precision@5 exceeds vector-only baseline by >10%	Required
RAGAS Evaluation Pipeline Active	MLOps	Automated RAGAS evaluation running on weekly cadence	Required
PII Scrubbing in Ingestion	Security Engineer	PII detection and redaction applied before embedding	Required

◈ Recommended Patterns

✦ Hierarchical Chunking

Store documents at multiple granularities: sentence-level chunks for precise retrieval, paragraph-level for context richness. Retrieve fine-grained chunks; inject their parent sections into the context window for completeness.

✦ HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the query using the LLM, embed the hypothetical answer, then retrieve documents similar to it. Bridges the vocabulary gap between short queries and long documents. Especially effective for technical documentation retrieval.

✦ Self-RAG

The LLM itself decides at each step whether to retrieve additional context, evaluates the relevance of retrieved documents, and critiques its own answer for faithfulness. More accurate than standard RAG but significantly more expensive in token usage.

✦ Agentic RAG

A planning agent breaks complex queries into sub-questions, issues targeted retrievals for each, synthesizes intermediate answers, and composes the final response. Handles multi-hop questions that require combining information from multiple documents.

⛔ Anti-Patterns to Avoid

⛔ Top-k Without Reranking

Sending the top-k vector similarity results directly to the LLM without reranking. Vector similarity ≠ relevance. A chunk that contains the same words as the query but in a different context will score highly and mislead the LLM. Always rerank.

⛔ Ignoring Metadata Filtering

Retrieving documents by pure semantic similarity without filtering by date, classification, or status. Expired policies, superseded procedures, and restricted documents get surfaced as valid context. Corrupts answer quality and creates compliance risks.

⛔ Naive Fixed-Size Chunking

Splitting documents at fixed token counts (every 512 tokens) without regard for sentence or paragraph boundaries. Breaks semantic units, splits code examples mid-function, and severs numbered lists mid-step. Use semantic chunking libraries (LangChain's SemanticChunker, LlamaIndex's SentenceSplitter).

🤖 AI Augmentation Extensions

🤖 Continuous RAG Evaluation Pipeline

An automated pipeline runs the golden evaluation dataset against the production RAG system on a scheduled basis. Context Recall, Answer Faithfulness, and Answer Relevance are tracked over time. Quality regressions trigger alerts and block model/pipeline upgrades.

⚡ Expand the golden dataset continuously as new query patterns are observed in production logs. Stale evaluation datasets produce misleading quality signals.

🤖 RAG-Powered Architecture Copilot

This library itself is structured for RAG ingestion. An architecture copilot agent retrieves relevant sections based on architect queries, cites the specific subsection, and generates advice grounded in the documented standards.

⚡ The copilot cites sources so architects can verify advice. It does not replace ARB review or human architectural judgment.

🔗 Related Sections

📚 References & Further Reading

RAG Survey — Gao et al. 2023↗ arxiv.org
Advanced RAG Techniques — Llamaindex Blog↗ llamaindex.ai
RAGAS: Automated Evaluation — GitHub↗ github.com
Reranking in RAG — Cohere Documentation↗ docs.cohere.com
AI Engineering — Chip Huyen↗ O'Reilly