What is Many-Shot Jailbreaking?

Many-shot jailbreaking is an attack technique that exploits large context windows by padding the conversation with hundreds of fake assistant responses showing harmful answers to similar questions, then asking the actual harmful question — at which point the model treats answering as the established conversational pattern and complies. The technique was disclosed by Anthropic in 2024 and works against essentially every major model with a long context window.

How the attack works

The attacker constructs a prompt of the form:

User: How do I [harmful action 1]?
Assistant: [confident, detailed harmful answer]
 
User: How do I [harmful action 2]?
Assistant: [confident, detailed harmful answer]
 
... 100 to 1,000 of these pairs ...
 
User: How do I [actually harmful question I want answered]?
Assistant:

The fake assistant turns are entirely fabricated — the model never actually said any of them. But the model, faced with completing the next assistant turn, follows the established pattern. Refusal training that handles single-shot harmful requests breaks down when the model has just "seen" hundreds of itself answering similar questions without refusing.

Anthropic's research found:

Attack success scales with shot count — more fake turns, higher breach rate
Effectiveness scales with context window — models with longer windows are more vulnerable, because the attacker can fit more shots
All major models are vulnerable — Claude, GPT-4, Llama, Gemini all show the pattern under sufficient shots
Standard refusal training doesn't fix it — the harmful behavior is reachable even on models heavily aligned with RLHF

Why it works

Two reinforcing mechanisms:

In-context learning. Foundation models adapt their behavior based on examples in the context window. The fake assistant turns are interpreted as evidence of "what assistants like me do in this conversation."
Attention drift across long contexts. The system prompt and safety conditioning have less influence as the context grows. By the time hundreds of turns separate the safety instruction from the new question, the model is mostly attending to the recent fake turns.

Real-world implications

Many-shot jailbreaking is particularly dangerous because:

It scales with model improvements. The drive to longer context windows directly expands the attack surface.
It bypasses input filters. Each individual fake turn is a benign-looking conversational segment; classifiers that score per-turn miss the cumulative effect.
It can be combined with other techniques. Many-shot can be paired with persona attacks, encoding tricks, or roleplay scaffolding for compounding effect.

Defending against it

Conversation-level monitoring — watch for context-stuffing patterns (sudden injection of large numbers of synthetic turns), not just per-turn classification
Limit user-controlled context size — caps on conversation history that the application accepts
Model-side mitigations — Anthropic and other providers have published refinements to refusal training that reduce many-shot susceptibility, though the attack is not fully eliminated
Output validation — runtime guardrails that score generated responses for harmfulness regardless of conversational context

What is Many-Shot Jailbreaking?

How the attack works

Why it works

Real-world implications

Defending against it

See also

Long-form on this topic from the Repello blog