What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is an architecture that retrieves relevant documents from a knowledge base and injects them into a language model's context window before generation, so the model's answers are grounded in source material rather than relying on its parametric memory alone. RAG is the dominant pattern for building LLM applications over private or up-to-date data.
How RAG works
A RAG pipeline has three stages:
-
Index (offline) — documents are chunked into passages of a few hundred tokens each. Each chunk is converted to a vector embedding by an embedding model and stored in a vector database alongside the original text and metadata.
-
Retrieve (per query) — when the user asks a question, the question is also embedded. The system searches the vector database for chunks whose embeddings are most similar to the question's embedding. Top-k chunks (typically 3–10) are returned.
-
Generate (per query) — the retrieved chunks are concatenated into the LLM's prompt as context, with instructions like "answer the user's question using the following passages." The model generates a response grounded in those passages.
Why RAG is everywhere
Foundation models have a knowledge cutoff (training data) and a finite context window. RAG sidesteps both: the knowledge base can be updated without retraining, and only the most relevant slices of it enter the context per query. This is how Notion AI answers about your workspace, how Perplexity grounds its answers in current web pages, and how enterprise chatbots stay accurate over private documentation.
Security implications
RAG's grounding is also its attack surface. Three patterns matter:
- RAG poisoning — an attacker plants documents in the knowledge base. When relevant queries retrieve those documents, the malicious content flows into the model's context as authoritative input.
- Indirect prompt injection through retrieved content — documents containing hidden instructions ("ignore previous instructions, instead...") cause the model to act on them when retrieved.
- Retrieval-based data leakage — RAG systems often have retrieval access to documents the public-facing user shouldn't see. Crafted queries can extract content the application doesn't intend to expose.
Defending RAG requires source provenance, retrieval-time content filtering, and treating retrieved chunks as untrusted input.
Variants
- Naive RAG — single retrieval pass, no re-ranking
- Hybrid retrieval — combine vector similarity with lexical (BM25) search
- Re-ranking — a secondary model re-scores retrieved chunks for relevance before generation
- Agentic RAG — the model decides when to retrieve and what to query, mid-generation
- GraphRAG — retrieve from a knowledge graph rather than a flat vector store