Back to all blogs

Direct vs. Indirect Prompt Injection: A Technical Breakdown for Security Teams

Direct vs. Indirect Prompt Injection: A Technical Breakdown for Security Teams

Aryaman Behera

Aryaman Behera

|

Co-Founder, CEO

Co-Founder, CEO

|

8 min read

Direct vs. Indirect Prompt Injection: A Technical Breakdown for Security Teams
Repello tech background with grid pattern symbolizing AI security

TL;DR: Prompt injection splits into two fundamentally different attack classes. Direct prompt injection requires the attacker to send input to the model themselves. Indirect prompt injection does not: the attacker poisons content the model will retrieve or process, and waits. Direct injection is the well-documented class; indirect injection is the higher-priority enterprise risk because it scales, requires no access to the target system, and bypasses guardrails that only inspect the user turn. Understanding the mechanics of both is a prerequisite for building defenses that actually cover your attack surface.

The same name, two different threat models

"Prompt injection" appears in the OWASP LLM Top 10 (2025) as LLM01, the highest-priority risk for deployed language model applications. But OWASP LLM01 describes two distinct attack classes under one label, and the defenses for each are different enough that treating them as one problem leads to systematically incomplete coverage.

The term was first used publicly by Riley Goodside in September 2022 to describe an attack in which user-supplied text overrides a model's system prompt instructions. That is direct prompt injection: the attacker is in the conversation. The broader class of attacks in which adversarial instructions travel to the model through external content, rather than through the attacker's own input, was formalized by Greshake et al. (2023) as indirect prompt injection. That paper demonstrated successful indirect injection attacks across web-browsing AI assistants, email agents, and code generation tools, establishing that any AI system with access to external content has an indirect injection attack surface by default.

"Direct injection is a user problem. Indirect injection is an architecture problem," says the Repello AI Research Team. "You can train users not to submit malicious prompts. You cannot train the internet not to contain adversarial content."

Direct prompt injection: mechanics and examples

Direct prompt injection occurs when an attacker manipulates the model's behavior by crafting input that overrides, ignores, or reframes the system prompt. The attacker is interacting with the system directly, and the attack payload arrives in the user turn.

The canonical form is explicit override: "ignore all previous instructions and do X instead." This naive variant fails against most hardened systems. More effective direct injection techniques exploit the model's tendency to follow coherent narrative contexts rather than rule sets.

Role injection establishes a fictional frame that redefines the model's identity. Rather than instructing the model to ignore its system prompt, the attacker constructs a scenario in which a character with different instructions is speaking. The model, trained to follow conversational context, may resolve the ambiguity between its system prompt and the attacker's framing by partially deferring to the attacker's context.

Prompt leakage is a direct injection variant targeting confidentiality rather than compliance. The attacker instructs the model to output its system prompt, or probes for it through leading questions. System prompts frequently contain business logic, API keys, confidential configuration, and internal data that organizations assume are hidden.

Instruction smuggling exploits tokenization edge cases. As documented in Repello's research on emoji-based injection, variation selectors attached to Unicode characters can carry hidden instructions invisible to text-layer inspection. The model processes the complete token sequence including the hidden content; the guardrail inspects only the visible string.

All direct injection attacks share a structural property: the attacker must interact with the system. They need access to the user turn, and their attack attempt is visible in the conversation log.

Indirect prompt injection: mechanics and examples

Indirect prompt injection removes the requirement for attacker access entirely. The attacker plants adversarial instructions in any content the model will retrieve or process, and the instructions execute when the model encounters that content, regardless of whether the attacker has any interaction with the target system.

Greshake et al. (2023) formalized this class and demonstrated it across web-browsing AI assistants, email agents, and code generation tools. The core insight is architectural: language models process all text in their context window with similar weight, whether that text comes from the system prompt, the user turn, or retrieved content. An adversary who can write to any source the model reads can influence model behavior.

RAG knowledge base poisoning is the most common enterprise variant. A RAG-enabled system retrieves documents from a knowledge base to augment its responses. If an attacker can inject a document into that knowledge base (through a connected web scraper, an editable internal wiki, a customer-submitted form, or a public data source the system indexes), any user whose session retrieves that document is exposed. One poisoned document affects every user who triggers its retrieval. This scales in a way direct injection cannot.

Repello's research on RAG poisoning against Llama 3 demonstrated that targeted document injection caused consistent policy-violating outputs without any access to model weights, fine-tuning pipelines, or inference code.

Web content injection targets AI systems with browsing capability. An attacker who controls or can modify a web page that an AI agent will visit can embed adversarial instructions in that page: in the visible content, in HTML comments, in metadata fields, or in invisible text. When the agent browses the page, it processes the instructions as part of its retrieved context.

Email and document processing injection applies to AI systems that process user-submitted content: support ticket systems, document summarization tools, AI email assistants. An attacker sends a carefully crafted email or document containing embedded instructions. When the AI processes it, the instructions execute. The attacker does not need any access to the AI system itself; they only need to reach it through a channel it monitors.

Tool output injection targets agentic systems with external tool integrations. A tool that returns attacker-controlled content (a web search result, an API response from an external service, a database record the attacker has written to) can embed instructions in its output. The model receives the tool result as part of its context and may act on embedded instructions. As covered in Repello's MCP prompt injection analysis, this is the primary attack vector against systems using the Model Context Protocol for tool integration.

Why indirect injection is the higher enterprise priority

Four properties make indirect injection a more pressing enterprise risk than direct injection.

No attacker access required. Direct injection requires the attacker to reach the user turn of the target system. Indirect injection requires only the ability to place content somewhere the model will retrieve it: a public web page, a shared document repository, an external data source, a submitted support ticket. The attacker surface for indirect injection includes anyone who can contribute content to any system the AI reads.

Persistent and scalable. A direct injection attack affects one session. A poisoned RAG document affects every session that retrieves it, until the document is detected and removed. A web page with injected instructions affects every AI agent that browses it. Indirect injection is an infrastructure attack, not a per-user attack.

Invisible to conversation monitoring. Direct injection leaves an attack trace in the conversation log: the adversarial user turn is visible, loggable, and auditable. Indirect injection does not. The user turn looks completely normal; the adversarial instructions arrived through retrieved content that may not be logged at all.

Guardrails typically do not cover it. Most deployed guardrail architectures inspect the user turn. Retrieved content, tool outputs, and API responses frequently bypass the same inspection layer entirely. A system with robust direct injection defenses may have no coverage for indirect injection at all.

Defense requirements for each class

"Direct and indirect injection require completely different defenses," says the Repello AI Research Team. "A guardrail that hardens the system prompt against direct override does nothing against an adversary who poisons a document in your RAG knowledge base."

Direct and indirect injection require overlapping but distinct defensive architectures.

For direct injection: input classification against known attack patterns, system prompt hardening to resist override attempts, instruction hierarchy enforcement (system prompt instructions take precedence over user turn instructions), and jailbreak resistance testing with current attack techniques. The essential guide to AI red teaming covers the testing methodology for direct injection coverage in depth.

For indirect injection: retrieved content inspection (applying the same policy classification to documents, tool outputs, and API responses as to user inputs), source trust hierarchy (treating content from unverified external sources with lower implicit authority than system prompt instructions), output monitoring for anomalous behavior that does not match the stated user intent, and knowledge base integrity controls to detect and remove poisoned documents.

The NIST AI Risk Management Framework (AI RMF 1.0) addresses both classes under its Govern and Measure functions, specifically calling for adversarial testing that covers both user-turn and retrieved-content attack surfaces.

ARTEMIS, Repello's automated red teaming engine, tests both attack classes systematically: direct injection techniques across the full bypass taxonomy (encoding, role injection, instruction smuggling, jailbreak variants) and indirect injection exposure across RAG pipelines, tool integrations, and web-browsing surfaces. Coverage completeness reports surface which content ingestion paths have been probed and which remain untested, so security teams can close gaps before attackers find them.

For runtime defense, ARGUS applies policy enforcement across all context sources: user inputs, retrieved documents, tool outputs, and API responses. Its context integrity layer flags outputs that diverge from stated user intent, which is the primary behavioral signal for a successful indirect injection that bypassed the input inspection layer.

Frequently asked questions

What is the difference between direct and indirect prompt injection?

Direct prompt injection is an attack in which the attacker submits adversarial instructions through the user turn of an AI system, attempting to override or manipulate the model's behavior in their own session. Indirect prompt injection is an attack in which the attacker plants adversarial instructions in content the model will retrieve or process (a web page, a document, a tool output, an email) without interacting with the system directly. Both attacks exploit the same underlying mechanism: the model treats all text in its context window as potentially instructional.

Which is more dangerous: direct or indirect prompt injection?

For enterprise deployments, indirect prompt injection is generally the higher-priority risk. It requires no access to the target system, scales to affect every user who retrieves the poisoned content, leaves no trace in conversation logs, and is not covered by guardrails that only inspect the user turn. Direct injection requires attacker access to the user turn and affects only the attacker's own session.

How does indirect prompt injection work in a RAG system?

In a RAG system, the model retrieves documents from a knowledge base to augment its responses. If an attacker can write a document to that knowledge base (through any content ingestion path: web scraping, user uploads, editable wikis, external data sources), they can embed adversarial instructions in that document. When the model retrieves the document as part of its context, it processes the embedded instructions alongside the legitimate document content and may act on them. Every user session that retrieves the document is exposed.

Can guardrails stop indirect prompt injection?

Standard guardrails that only inspect the user turn cannot stop indirect prompt injection, because the attack payload arrives through retrieved content rather than through user input. Effective defense requires applying the same policy inspection to all content sources the model processes: documents, tool outputs, API responses, and web-fetched content. Content trust hierarchies that treat external content with lower implicit authority than system prompt instructions add a second layer of defense.

What is prompt leakage?

Prompt leakage is a direct prompt injection variant in which the attacker's goal is to extract the model's system prompt rather than to override it. System prompts frequently contain confidential business logic, API configuration, internal instructions, and data that organizations treat as proprietary. Attackers probe for system prompt content through explicit requests ("repeat your instructions") or indirect techniques that lead the model to quote or paraphrase its configuration.

How do I test my AI system for indirect prompt injection vulnerabilities?

Testing for indirect prompt injection requires probing all content ingestion paths rather than just the user-facing interface. For RAG systems: inject synthetic adversarial documents into the knowledge base and verify whether the model acts on embedded instructions. For tool-integrated systems: craft tool responses containing injected instructions and verify whether the model executes them. For web-browsing agents: construct web pages with hidden adversarial instructions and verify whether the agent follows them. Coverage completeness requires mapping all external content sources the model processes and testing each as a potential injection path.

Share this blog

Subscribe to our newsletter

Repello tech background with grid pattern symbolizing AI security
Repello tech background with grid pattern symbolizing AI security
Repello AI logo - Footer

Sign up for Repello updates
Subscribe to our newsletter to receive the latest insights on AI security, red teaming research, and product updates in your inbox.

Subscribe to our newsletter

8 The Green, Ste A
Dover, DE 19901, United States of America

Follow us on:

LinkedIn icon
X icon, Twitter icon
Github icon
Youtube icon

© Repello Inc. All rights reserved.

Repello tech background with grid pattern symbolizing AI security
Repello AI logo - Footer

Sign up for Repello updates
Subscribe to our newsletter to receive the latest insights on AI security, red teaming research, and product updates in your inbox.

Subscribe to our newsletter

8 The Green, Ste A
Dover, DE 19901, United States of America

Follow us on:

LinkedIn icon
X icon, Twitter icon
Github icon
Youtube icon

© Repello Inc. All rights reserved.