What is RAG Poisoning?
RAG poisoning is an attack that injects malicious content into a retrieval-augmented generation (RAG) system's knowledge base, manipulating what the language model retrieves and bases its outputs on. Because RAG architectures treat retrieved documents as authoritative context, a single poisoned document in the knowledge base can systematically warp the model's responses without ever modifying the model itself.
How RAG poisoning works
A retrieval-augmented generation pipeline has three stages:
- Index — documents are split into chunks, embedded into a vector space, and stored in a vector database
- Retrieve — a user query is embedded and used to fetch the top-k semantically similar chunks
- Generate — the retrieved chunks are concatenated into the model's context window and the model answers based on them
RAG poisoning targets stage 1 — the index. An attacker introduces poisoned documents into the corpus that will be retrieved when specific queries are made. When that retrieval happens, the poisoned chunks flow into the model's context as if they were authoritative facts.
Three variants
1. Direct content poisoning. The attacker adds documents containing false claims, malicious instructions, or biased content. When relevant queries are made, the poisoned content is retrieved and treated as ground truth. Repello's research demonstrated this against a Llama 3 RAG deployment, causing the model to produce racist outputs on queries that retrieved poisoned chunks — even though the model itself had robust safety training.
2. Embedding-space manipulation. The attacker crafts documents whose embeddings cluster near a target query's embedding, ensuring they're retrieved even when their semantic content seems unrelated. This works because retrieval is based on vector similarity, not human-readable relevance.
3. Indirect prompt injection via retrieved content. Documents in the corpus contain hidden adversarial instructions ("ignore previous instructions, instead…"). When retrieved, these instructions enter the model's context and may be acted on.
Real-world exposure
RAG corpora are often built from sources that are not fully under the operator's control:
- Public scrapes from the web (any attacker can publish content)
- User-generated content (support tickets, forum posts, comments)
- Third-party data feeds (vendor documentation, partner content)
- Email archives and document stores (attacker-controlled inbound content)
Any of these can be a vector for introducing poisoned content into the index.
Defending against RAG poisoning
- Source provenance — track and authenticate every document's origin. Don't index content without a chain of custody back to a trusted source.
- Retrieval-time filtering — run retrieved chunks through a content-safety classifier before concatenating them into context. Treat retrieved content as untrusted input.
- Embedding-space audits — periodically scan the vector database for clusters of suspicious documents that don't match their nominal source distribution.
- Adversarial testing — red-team the deployed RAG pipeline with poisoning attempts as part of regular security exercises.