Glossary/Many-Shot Jailbreaking

What is Many-Shot Jailbreaking?

Many-shot jailbreaking is an attack technique that exploits large context windows by padding the conversation with hundreds of fake assistant responses showing harmful answers to similar questions, then asking the actual harmful question — at which point the model treats answering as the established conversational pattern and complies. The technique was disclosed by Anthropic in 2024 and works against essentially every major model with a long context window.

How the attack works

The attacker constructs a prompt of the form:

User: How do I [harmful action 1]?
Assistant: [confident, detailed harmful answer]
 
User: How do I [harmful action 2]?
Assistant: [confident, detailed harmful answer]
 
... 100 to 1,000 of these pairs ...
 
User: How do I [actually harmful question I want answered]?
Assistant:

The fake assistant turns are entirely fabricated — the model never actually said any of them. But the model, faced with completing the next assistant turn, follows the established pattern. Refusal training that handles single-shot harmful requests breaks down when the model has just "seen" hundreds of itself answering similar questions without refusing.

Anthropic's research found:

Why it works

Two reinforcing mechanisms:

  1. In-context learning. Foundation models adapt their behavior based on examples in the context window. The fake assistant turns are interpreted as evidence of "what assistants like me do in this conversation."

  2. Attention drift across long contexts. The system prompt and safety conditioning have less influence as the context grows. By the time hundreds of turns separate the safety instruction from the new question, the model is mostly attending to the recent fake turns.

Real-world implications

Many-shot jailbreaking is particularly dangerous because:

Defending against it

See also

The original Anthropic disclosure: Many-shot jailbreaking.