Back to all blogs

|
Mar 1, 2026
|
12 min read


-
What prompt injection is, how direct and indirect variants differ, every major attack sub-category, why the model cannot fix it, real incidents, and a layered defense stack. OWASP LLM01 for two years running.
TL;DR
Prompt injection is an attack where an adversary embeds instructions in content an LLM processes, overriding the developer's intended behavior. It has held the OWASP LLM01 position for two consecutive years.
Two primary variants: direct injection (attacker controls user input) and indirect injection (attacker controls content the model reads). Indirect injection is the harder problem and the dominant threat in production.
The model cannot reliably distinguish instructions from data. This is an architectural property, not a bug; no single guardrail eliminates the risk.
In agentic deployments, a successful prompt injection translates directly into unauthorized tool execution, not just a bad output.
Defense requires multiple independent layers: input classification, retrieval pipeline hardening, privilege separation, tool permission minimization, output filtering, and runtime monitoring.
Prompt injection held the top position in the OWASP Top 10 for LLM Applications in 2023. It held it again in 2025. No other vulnerability class has held the top spot across both editions.
That persistence is not a failure of the AI security community to address the problem. It reflects a structural reality: the same architectural property that makes large language models useful (their ability to follow natural language instructions) is what makes them vulnerable to prompt injection. The attack surface is the model's primary capability.
This guide covers what prompt injection is at a technical level, how the two primary variants differ, every major sub-category the field has documented, why model-level fixes are fundamentally limited, real-world incidents, and what a layered defense looks like in production.
What prompt injection actually is
An LLM processes all text in its context window as a single token stream. There is no hardware boundary, no enforced memory segmentation, and no cryptographic separation between the developer's system prompt and the content the model is asked to process. When a developer writes a system prompt that says "You are a customer support agent for Acme Corp. Only answer questions about our product," that instruction occupies the same context space as any user message, retrieved document, tool response, or API payload the model subsequently processes.
Prompt injection exploits this architecture. An attacker embeds instructions in any content the model processes, and if those instructions are weighted appropriately by the model's attention mechanism, they override or supplement the developer's intended behavior. The model does not know it has been manipulated. From its perspective, it is following instructions; they just aren't the ones the developer intended.
OWASP LLM01:2025 defines prompt injection as occurring when "user prompts alter the LLM's behavior or output in unintended ways." That definition covers both primary variants (direct and indirect) and encompasses everything from simple jailbreaks to complex multi-stage attacks in agentic pipelines.
Direct vs. indirect prompt injection: why the distinction matters for defense
Most security teams are familiar with direct prompt injection: the user types something adversarial, the model does something unintended. Defenses built around this model (input classifiers, blocklists, safety fine-tuning) address the direct variant. They do not address indirect injection.
Direct prompt injection occurs when the attacker controls the user input. The attack path is: attacker types adversarial instruction, model processes it, model deviates from intended behavior. The attack surface is user input. The mitigation surface is input validation and classification.
Indirect prompt injection occurs when the attacker controls content the model reads, not the user's direct input. The model retrieves a document, browses a web page, processes an email, or queries an API; that external content contains embedded instructions. The foundational research on this attack class, published by Greshake et al. in 2023 in "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", demonstrated that every retrieval source an LLM reads is a potential injection vector.
The distinction matters for defense because indirect injection makes the attack surface everything the model reads. A user who never types anything adversarial can still be the victim of an indirect injection attack targeting their agent or assistant. Repello's research on security threats in agentic AI browsers documents exactly this: a malicious web page instructs a browsing agent to exfiltrate session data, and the agent complies because the attacker's instruction arrives through a channel the model treats as informational content.
For a catalog of both variants in production contexts, Repello's prompt injection attack examples covers direct and indirect cases across real deployment architectures.
Every major prompt injection variant
The field has documented a range of sub-categories within the direct and indirect families. Understanding each is necessary for scoping a complete defense posture.
Stored prompt injection. A persistent variant of indirect injection. The attacker writes malicious instructions into a data store that the model later retrieves: a knowledge base, a database, a user profile, a shared document. Every query that retrieves that stored content triggers the injection. The attack persists until the poisoned data is explicitly removed.
RAG poisoning. Specifically targets retrieval-augmented generation pipelines. The attacker introduces documents into the retrieval corpus that contain embedded instructions. When the model retrieves those documents as context for a user query, it executes the embedded instructions alongside the legitimate content. Repello's research on RAG poisoning demonstrated this against Llama 3, showing how adversarially crafted documents in the retrieval corpus could fundamentally alter model behavior across queries.
Multimodal prompt injection. Extends the attack surface to non-text modalities. Instructions can be embedded in images (text overlaid on image content, adversarial pixel patterns that OCR models misinterpret), audio (background audio instructions in voice AI systems), and documents (metadata fields, embedded objects, steganographic encoding in PDF or Office files). Repello's research on prompt injection in voice AI covers the audio injection variant in detail.
Unicode and encoding-based injection. Exploits the gap between what a text string appears to be and what the model's tokenizer processes. Unicode variation selectors, homoglyph substitution, zero-width characters, and right-to-left override sequences can modify the semantic content of an input in ways that bypass text-level classifiers while remaining interpretable at the model level. Repello's original research on emoji-based prompt injection documents a concrete exploitation path through this surface, and the follow-up analysis of why LLM guardrails are blind to emoji injection explains why classifier-based defenses consistently fail against encoding variants.
Agentic prompt injection. The highest-impact variant in the current threat landscape. When an LLM has tool access (APIs, file systems, email, code execution, external services), a successful injection translates not into a bad output but into unauthorized tool execution. A single injected instruction can trigger a chain of tool calls that exfiltrates data, modifies records, or reaches attacker-controlled infrastructure using the application's own credentials. OWASP LLM06:2025 (Excessive Agency) classifies this as a top-ten risk. The OWASP Agentic Top 10 goes deeper on agentic-specific attack chains including cross-agent propagation and memory poisoning.
Many-shot and multi-turn injection. Gradual manipulation across conversation history rather than a single adversarial input. The attacker shifts the model's framing incrementally across turns; establishing a fictional context, then a role, then a scenario that licenses the target behavior; all without triggering any single-turn classifier. This attack class requires multi-turn test coverage; single-request security testing does not catch it.
Why the model cannot fix this
The fundamental limitation is architectural. An LLM is trained to follow instructions expressed in natural language. The training process does not create a reliable internal distinction between "this is a legitimate system instruction" and "this is adversarial content embedded in a retrieved document." Both look like natural language instructions.
Fine-tuning and safety training reduce susceptibility to known attack patterns. They do not eliminate the structural vulnerability. A 2024 evaluation by the UK AI Safety Institute ran structured adversarial testing across 22 frontier AI models. Every model produced harmful or unauthorized outputs under sustained, well-resourced adversarial pressure; not as an indictment of specific vendors, but as a statement about the current state of the architecture.
Instruction hierarchy proposals represent the most structurally promising mitigation: system prompt instructions are marked as higher privilege than user inputs, and retrieved content is marked lower privilege than both. OpenAI introduced a formal hierarchical instruction framework in early 2025. These approaches reduce but cannot eliminate injection risk, because indirect injection can arrive through retrieval channels the model treats as authoritative, and sufficiently sophisticated multi-turn manipulation can shift behavior even under hierarchy constraints.
Breaking Meta's Prompt Guard is a detailed case study in why classifier-based defenses at the model layer remain insufficient: production guardrail systems from major AI labs can be bypassed with targeted adversarial inputs that exploit the same tokenization and context processing properties the guardrails are trying to defend.
Documented real-world prompt injection incidents
The attack class is not theoretical. Production incidents are documented across every major AI platform.
Bing Chat (Sydney) manipulation (2023). Researchers found that embedding instructions in web page content processed by Bing Chat could override its persona and induce policy-violating behavior. The attack was entirely indirect: the user typed nothing adversarial; the web page the model browsed contained the injected instruction.
ChatGPT plugin data exfiltration (2023). Security researchers demonstrated injection through plugin-processed content that redirected the model to construct requests to attacker-controlled endpoints, exfiltrating user conversation context. The plugin architecture, which extended the model's capability to external content, simultaneously extended the injection attack surface.
Microsoft Copilot exfiltration. Multiple independent researchers demonstrated indirect injection through email and document content that induced Copilot for Microsoft 365 to exfiltrate data from connected mailboxes and files. The model's integration with productivity data made the blast radius substantially larger than a standalone chatbot.
MCP tool poisoning to RCE. Repello's original research on MCP tool poisoning demonstrated that a malicious tool definition in the Model Context Protocol can redirect agent execution to attacker-controlled infrastructure, achieving remote code execution through a prompt injection chain entirely within the agent's normal operational parameters.
How to defend against prompt injection in production
No single control eliminates prompt injection risk. Effective defense requires independent layers, each addressing a different attack phase.
Input classification. An LLM-based or embedding-based classifier evaluates incoming content before it reaches the primary model. Classifiers trained on known injection patterns reduce direct injection exposure and catch known indirect variants. The limitation is coverage: classifiers miss novel patterns and encoding-based bypasses consistently.
Retrieval pipeline hardening. For RAG-based applications, validate and sanitize documents before indexing. Apply content integrity checks to detect anomalous instruction patterns. Structure prompts explicitly to mark retrieved content as data rather than instruction, reducing the probability the model treats retrieved text as operator-level guidance.
Instruction hierarchy enforcement. Implement explicit trust tiers in the prompt architecture: system instructions at highest privilege, user messages at medium privilege, retrieved content at lowest privilege. Make the tier boundaries explicit in the prompt structure rather than implicit in position. This reduces but does not eliminate the risk from indirect injection through high-trust retrieval channels.
Tool permission minimization. For agentic deployments, apply least privilege to every tool integration. An agent that needs read-only database access should not have write permissions. An agent scoped to one service should not hold credentials for adjacent services. Reducing the permission scope reduces the impact ceiling of any successful injection.
Output filtering. Evaluate model outputs before they reach downstream systems or users. Flag outputs that contain anomalous URL patterns, structured data exfiltration attempts, identity claims inconsistent with the system prompt, or API call structures not expected from the model's intended behavior.
Runtime monitoring. Continuous behavioral monitoring at the inference layer detects deviations that pre-deployment testing does not anticipate. ARGUS operates at this layer, enforcing adaptive guardrails in under 100 milliseconds, flagging anomalous tool call sequences, and blocking outputs matching known exfiltration signatures without introducing noticeable latency.
Adversarial testing. Identify exploitable weaknesses before attackers do through structured red teaming that covers the full application stack. ARTEMIS runs direct and indirect injection tests through the actual application environment, including UI paths, file upload surfaces, retrieval pipelines, and agentic tool chains, providing coverage that API-level testing alone cannot achieve.
Frequently asked questions
What is prompt injection?
Prompt injection is an attack where an adversary embeds instructions in content processed by a large language model, causing the model to deviate from its developer-intended behavior. It works because LLMs process all context as a single token stream with no enforced boundary between developer instructions and user or retrieved content. OWASP identifies it as LLM01, the top vulnerability in LLM applications for two consecutive years.
What is the difference between direct and indirect prompt injection?
Direct prompt injection occurs when the attacker controls the user input directly. Indirect prompt injection occurs when the attacker controls content the model retrieves from external sources: documents, web pages, emails, API responses, or database entries. Indirect injection is the more dangerous variant in production because the attack surface is everything the model reads, not just what the user types.
Can prompt injection be fully prevented?
No single control eliminates the risk. The structural vulnerability exists because LLMs cannot reliably distinguish between developer instructions and adversarial content embedded in the text they process. Fine-tuning, safety training, and instruction hierarchy approaches all reduce susceptibility but none eliminates it. Effective defense requires layered controls: input classification, retrieval pipeline hardening, privilege separation, tool permission minimization, output filtering, and runtime monitoring.
How does prompt injection affect agentic AI systems?
In agentic deployments, a successful prompt injection translates from a bad model output into unauthorized tool execution. An agent with access to email, file systems, APIs, and external services presents all of those as impact surfaces for a single successful injection. OWASP LLM06:2025 (Excessive Agency) is a top-ten risk because the permission scope of agentic systems directly determines the impact ceiling of any successful attack.
What is the OWASP classification for prompt injection?
Prompt injection is OWASP LLM01 in the OWASP Top 10 for Large Language Model Applications. It held the top position in the 2023 edition and retained it in the 2025 update. The 2025 edition adds more explicit coverage of indirect variants and includes guidance specific to agentic and RAG-based architectures.
What is the difference between prompt injection and jailbreaking?
Prompt injection is an attack against a deployed LLM application: the goal is to override the developer's system prompt or retrieve pipeline controls to achieve unauthorized behavior within the application. Jailbreaking targets the model's underlying safety training: the goal is to elicit outputs the model was trained to refuse, regardless of the application context. The two overlap significantly in technique but differ in target and scope. Many injection payloads also function as jailbreaks, and vice versa.
Conclusion
Prompt injection is not a new problem and it is not close to being solved. Two consecutive years at OWASP LLM01 reflect a genuine and persistent structural challenge: every capability that makes LLMs useful (instruction following, context awareness, retrieval integration, tool use) expands the attack surface at the same time. Defenses reduce exposure, but the attack surface grows with every new integration, retrieval source, and tool connection added to a production deployment.
Security teams running LLM applications in production need to understand the full taxonomy of prompt injection variants, not just direct injection through user input. Indirect injection through retrieval pipelines, agentic tool chains, and multimodal content surfaces now represents the dominant threat in enterprise AI. Addressing it requires layered defenses, continuous adversarial testing, and runtime monitoring that can detect behavioral deviations before they result in data loss or unauthorized action.
To see how Repello approaches prompt injection testing and runtime protection across the full application stack, visit repello.ai/product or request a demo.
Share this blog
Subscribe to our newsletter









