Glossary/RAG (Retrieval-Augmented Generation)

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an architecture that retrieves relevant documents from a knowledge base and injects them into a language model's context window before generation, so the model's answers are grounded in source material rather than relying on its parametric memory alone. RAG is the dominant pattern for building LLM applications over private or up-to-date data.

How RAG works

A RAG pipeline has three stages:

  1. Index (offline) — documents are chunked into passages of a few hundred tokens each. Each chunk is converted to a vector embedding by an embedding model and stored in a vector database alongside the original text and metadata.

  2. Retrieve (per query) — when the user asks a question, the question is also embedded. The system searches the vector database for chunks whose embeddings are most similar to the question's embedding. Top-k chunks (typically 3–10) are returned.

  3. Generate (per query) — the retrieved chunks are concatenated into the LLM's prompt as context, with instructions like "answer the user's question using the following passages." The model generates a response grounded in those passages.

Why RAG is everywhere

Foundation models have a knowledge cutoff (training data) and a finite context window. RAG sidesteps both: the knowledge base can be updated without retraining, and only the most relevant slices of it enter the context per query. This is how Notion AI answers about your workspace, how Perplexity grounds its answers in current web pages, and how enterprise chatbots stay accurate over private documentation.

Security implications

RAG's grounding is also its attack surface. Three patterns matter:

Defending RAG requires source provenance, retrieval-time content filtering, and treating retrieved chunks as untrusted input.

Variants