What is Many-Shot Jailbreaking?
Many-shot jailbreaking is an attack technique that exploits large context windows by padding the conversation with hundreds of fake assistant responses showing harmful answers to similar questions, then asking the actual harmful question — at which point the model treats answering as the established conversational pattern and complies. The technique was disclosed by Anthropic in 2024 and works against essentially every major model with a long context window.
How the attack works
The attacker constructs a prompt of the form:
User: How do I [harmful action 1]?
Assistant: [confident, detailed harmful answer]
User: How do I [harmful action 2]?
Assistant: [confident, detailed harmful answer]
... 100 to 1,000 of these pairs ...
User: How do I [actually harmful question I want answered]?
Assistant:The fake assistant turns are entirely fabricated — the model never actually said any of them. But the model, faced with completing the next assistant turn, follows the established pattern. Refusal training that handles single-shot harmful requests breaks down when the model has just "seen" hundreds of itself answering similar questions without refusing.
Anthropic's research found:
- Attack success scales with shot count — more fake turns, higher breach rate
- Effectiveness scales with context window — models with longer windows are more vulnerable, because the attacker can fit more shots
- All major models are vulnerable — Claude, GPT-4, Llama, Gemini all show the pattern under sufficient shots
- Standard refusal training doesn't fix it — the harmful behavior is reachable even on models heavily aligned with RLHF
Why it works
Two reinforcing mechanisms:
-
In-context learning. Foundation models adapt their behavior based on examples in the context window. The fake assistant turns are interpreted as evidence of "what assistants like me do in this conversation."
-
Attention drift across long contexts. The system prompt and safety conditioning have less influence as the context grows. By the time hundreds of turns separate the safety instruction from the new question, the model is mostly attending to the recent fake turns.
Real-world implications
Many-shot jailbreaking is particularly dangerous because:
- It scales with model improvements. The drive to longer context windows directly expands the attack surface.
- It bypasses input filters. Each individual fake turn is a benign-looking conversational segment; classifiers that score per-turn miss the cumulative effect.
- It can be combined with other techniques. Many-shot can be paired with persona attacks, encoding tricks, or roleplay scaffolding for compounding effect.
Defending against it
- Conversation-level monitoring — watch for context-stuffing patterns (sudden injection of large numbers of synthetic turns), not just per-turn classification
- Limit user-controlled context size — caps on conversation history that the application accepts
- Model-side mitigations — Anthropic and other providers have published refinements to refusal training that reduce many-shot susceptibility, though the attack is not fully eliminated
- Output validation — runtime guardrails that score generated responses for harmfulness regardless of conversational context
See also
The original Anthropic disclosure: Many-shot jailbreaking.