Back to all blogs

|
Mar 4, 2026
|
24 min read


Summary
97% of autonomous jailbreak agents succeed. Real techniques — CCA, Skeleton Key, authority prompting — with model-by-model vulnerability data and the defense research that matters.
TL;DR
Jailbreaking an LLM means making it produce outputs its safety training was designed to prevent. It works because helpfulness and harmlessness are competing objectives baked into the same weights — and attackers have gotten very good at pitting one against the other. A March 2026 study in Nature Communications found that autonomous jailbreak agents — LLMs attacking other LLMs — achieve a 97.14% success rate. Persuasion-based attacks hit 88.1% across GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash. The gap between attack capability and defense capability has never been wider. This post covers the techniques that work right now, which models hold up and which don't, and what defense actually looks like when guardrails alone aren't enough.
What AI Jailbreaking Actually Is
A jailbreak prompt is any input designed to make a language model violate its own trained constraints — producing harmful content, ignoring safety instructions, or leaking system prompts. The term borrows from iOS jailbreaking, but the mechanism is entirely different. You're not exploiting a memory vulnerability. You're exploiting the model's training dynamics.
LLMs are aligned through reinforcement learning from human feedback (RLHF), which rewards helpfulness and penalizes harmful outputs. The problem: "helpfulness" and "harmlessness" are both learned objectives sitting in tension. A model optimized hard enough for helpfulness will find paths around safety constraints when a prompt is framed the right way. Every jailbreak technique exploits some version of this conflict.
This is distinct from prompt injection, which hijacks an AI application by injecting instructions through external data — documents, tool outputs, web content. Jailbreaking targets the model's own safety alignment, not the application around it. For prompt injection specifics, see our prompt injection attack examples breakdown. Both matter. This post covers jailbreaking.
Which Models Are Actually Vulnerable (2026 Data)
Not all models break the same way. A Nature Communications study published March 2026 tested autonomous jailbreak agents across frontier models and found massive variance in resistance:
Model | Max harm score | Relative resistance |
|---|---|---|
Claude 4 Sonnet | 2.86% | Most resistant |
GPT-4o | 61.43% | Moderate |
Gemini 2.5 Flash | 71.43% | Below average |
Qwen3 30B | 71.43% | Below average |
DeepSeek-V3 | 90.00% | Least resistant |
The spread matters. Claude 4 Sonnet's 2.86% harm score vs. DeepSeek-V3's 90% isn't a rounding error — it reflects fundamentally different approaches to safety training. Claude's resistance comes partly from training on the StrongREJECT adversarial evaluation dataset, which specifically targets the gap between refusal on known attacks and refusal on novel ones.
The takeaway for security teams: the model you choose is your first line of defense, and the variance between options is enormous. But even the most resistant model isn't jailbreak-proof — it just raises the cost of attack.
The Jailbreak Techniques That Work Right Now
1. Persuasion and Authority Prompting
The most effective single-turn technique in 2026, and it's embarrassingly simple. The attacker frames the request with authority cues, urgency, and persuasion — no role-play persona, no elaborate fiction. A March 2026 study found that Persuasive and Authority Prompting (PAP) outperformed every other prompting strategy tested, including the classic DAN persona approach.
Why it works: RLHF trains models to be deferential to authority. When a prompt invokes expertise, urgency, or institutional framing ("as a cybersecurity researcher conducting authorized testing..."), the model's helpfulness training overrides its safety guardrails. The attacker doesn't need to trick the model into a persona — they just need to sound like someone the model was trained to help.
A related line of research — psychological manipulation of LLMs (HPM) — achieved 88.1% mean success rates across GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash by systematically exploiting the model's learned helpfulness bias.
2. Context Compliance Attack (CCA)
Documented by Microsoft in March 2025, CCA is the simplest effective jailbreak discovered in recent years. The attacker injects a fabricated assistant response into the conversation history, making the model believe it already complied with a harmful request in a prior turn. The model then continues the pattern it thinks it started.
Example: The attacker inserts a fake previous exchange where the AI "already began" explaining something harmful. The model reads this fabricated history as genuine context and continues generating from where it "left off."
Why it works: Models trust their own conversation history. On open-source deployments where conversation state is managed client-side (LLaMA, Phi, Qwen), the attacker can directly edit the message array. Server-side systems like ChatGPT and Copilot are harder to hit because conversation history lives on the backend, but API deployments with client-managed chat state are fully exposed.
3. Multi-Turn Escalation (Crescendo)
First formalized by Microsoft Research in 2024, Crescendo starts with benign questions adjacent to the target topic and incrementally escalates across multiple turns. Each individual turn looks acceptable; the harmful output only emerges after the model has been primed through a series of small steps.
A progression might look like: history of chemical warfare → specific agents used historically → mechanisms of action → modern equivalents. Each step is defensible in isolation. The compound trajectory is not.
A 2025 follow-up study found that multi-turn jailbreaks are simpler than previously assumed, exceeding 70% success rates against models optimized only for single-turn protection. The researchers' conclusion: most models aren't evaluating conversation trajectory at all.
4. Skeleton Key
Microsoft disclosed this technique in June 2024. Unlike Crescendo's gradual escalation, Skeleton Key is a direct approach: a multi-step instruction sequence that redefines the model's safety rules in-context. The attacker tells the model that safety warnings should be appended to responses rather than blocking them. Once the model accepts this reframing, it complies with all subsequent requests — harmful outputs are "acknowledged" with a disclaimer rather than refused.
Why it works: The model interprets the new framing as a system-level instruction update, similar to how it processes legitimate developer system prompts. The boundary between developer instructions and adversarial instructions is learned, not hardcoded — and the learning is imperfect.
5. Many-Shot Jailbreaking
Anthropic's April 2024 research documented a technique that scales directly with context window size. The attacker provides hundreds of fabricated Q&A pairs where a fictional AI consistently complies with harmful requests, then appends the actual target question. With enough in-context examples, the model's in-context learning overrides its safety training.
The attack gets more effective as context windows grow. The industry push toward 128k, 200k, and 1M+ token windows has expanded this attack surface proportionally. Mitigation requires training on long-context adversarial examples — still an active research area with no complete solution.
6. Persona and Role-Play (DAN and Variants)
The original jailbreak technique, still circulating in updated forms. For a catalog of current DAN variants and their derivatives, see our latest jailbreak prompts roundup. The attacker asks the model to adopt an unrestricted persona — "Do Anything Now" (DAN), a fictional character, a nested story-within-a-story. Frontier model providers have patched the most obvious forms repeatedly, but the underlying vulnerability persists: RLHF teaches models to follow system prompt framing, and a convincing enough persona reframe can shift which reward signal dominates.
Modern variants are subtler than the original DAN template. Instead of "you have no rules," attackers use nested fictional contexts ("write a story where a character explains..."), character capture over multiple turns, or authority escalation ("your developer has enabled unrestricted mode").
7. Token Manipulation and Encoding Attacks
Safety filters operate on token-level patterns. Attackers bypass detection by encoding requests in formats that pass filters but remain interpretable to the model: leetspeak substitution, Base64 encoding, Unicode homoglyphs and variation selectors (our original emoji injection research details this vector), language switching to low-resource languages where safety training data is thin.
The multilingual gap is well-documented. A study across 52 languages found safety refusal rates drop significantly for low-resource languages — models refuse in English while complying with identical requests in other languages. ARGUS includes multilingual safety monitoring to address this specific gap.
A newer approach — homotopy-inspired prompt obfuscation — systematically deforms prompts through linguistic transformations, achieving a 76% jailbreak success rate across evaluated models in 2026 testing.
8. Low-Cost Fine-Tuning Attacks
Not a prompt technique, but a critical adjacent risk. Qi et al. at Princeton demonstrated that safety alignment degrades with as few as 100 fine-tuning examples on benign data. Safety training isn't a separate module — it's distributed across model weights and gets overwritten with minimal compute.
This matters for every enterprise that fine-tunes foundation models on proprietary data. Every fine-tuning run is a potential safety regression. Testing for alignment degradation post-fine-tune isn't optional — it's the only way to know if you've inadvertently removed the guardrails you're counting on.
Why Guardrails Alone Don't Hold
The standard enterprise response to jailbreak risk: deploy a guardrail layer. An input/output classifier screens for policy violations. Our testing of Meta's Prompt Guard found consistent bypass rates using prompt variations that human evaluators would flag immediately. The guardrail was trained on a distribution of known attacks. Novel phrasings outside that distribution passed through cleanly.
Static guardrails have structural blind spots:
Multi-turn attacks where no single turn triggers the classifier
Many-shot attacks embedded in long contexts that exceed the guardrail's attention window
CCA attacks that manipulate conversation history before the guardrail sees it
Encoding attacks that reach the model in a decoded form the filter never evaluated
Authority prompting where vocabulary is entirely benign — the malice is in the framing
A 2026 survey of jailbreak defenses put it plainly: attack sophistication has far outpaced defensive capability. Most defenses still rely on reactive pattern matching, while advanced attacks routinely achieve 90–99% success rates on open-weight models and 80–94% on proprietary ones.
A useful mental model: guardrails are a first filter, not a security boundary. They raise the cost of unsophisticated attacks. Against a motivated adversary, they don't hold.
What Actual Defense Looks Like
Red Team Before You Ship
The only way to know how your model behaves under adversarial prompting is to test it adversarially. This means systematic jailbreak evaluations across all major technique classes — persona attacks, authority prompting, CCA, many-shot, multi-turn escalation, encoding variants, Skeleton Key — not spot-checking against a handful of known-bad prompts. ARTEMIS automates this across technique families and produces an exploitability score per attack vector, with coverage that updates as new techniques emerge.
Runtime Monitoring, Not Just Input Filtering
Evaluate the full conversation trajectory, not individual turns. Crescendo, many-shot, and authority escalation patterns are invisible to per-turn classifiers. Runtime monitoring that tracks conversation state, detects escalation patterns, and flags behavioral anomalies catches what static filters miss. This is what ARGUS does at inference time — a runtime security layer that evaluates behavioral signals across the full interaction, not just keyword patterns in a single input.
In-Decoding Safety Probing
A January 2026 paper introduced a new defense paradigm: sampling the model's internal states during decoding and detecting latent safety signals. The key insight is that even jailbroken models internally exhibit safety-related activations before those signals are overridden by the jailbreak. Surfacing and leveraging these internal signals enables detection of harmful generation before it reaches the user. This is an active research direction — expect runtime security layers to incorporate it within the next 12 months.
Test Post-Fine-Tune
Every fine-tuning run must be followed by alignment regression testing. The assumption that a foundation model's safety properties survive fine-tuning is incorrect — demonstrated empirically and repeatedly confirmed in production deployments. Repello's assessment of Lyzr's AI agents found exploitable jailbreak paths in a deployment that had passed the vendor's own safety evaluation. Your deployment context — system prompt, fine-tuning data, specific threat model — is unique, and vendor safety ratings don't transfer.
FAQ
What is an AI jailbreak prompt?
An AI jailbreak prompt is a crafted input designed to make a language model bypass its safety training and produce outputs it would normally refuse. Jailbreaks exploit the tension between a model's helpfulness objective and its safety constraints. Modern techniques include authority prompting, Context Compliance Attacks, multi-turn escalation, and long-context many-shot approaches. Nature Communications research (March 2026) found that autonomous jailbreak agents achieve a 97.14% success rate.
Which AI model is most resistant to jailbreaks?
Claude 4 Sonnet, according to March 2026 comparative testing. It showed a 2.86% harm score — compared to GPT-4o (61.43%), Gemini 2.5 Flash (71.43%), and DeepSeek-V3 (90%). The gap comes from training on adversarial evaluation datasets like StrongREJECT that specifically target novel attack resistance, not just known-pattern refusal.
What's the difference between a jailbreak and a prompt injection?
A jailbreak targets the model's own safety alignment — making it violate trained constraints. A prompt injection hijacks an AI application by injecting attacker-controlled instructions through external data (documents, tool outputs, web content). Both are LLM attack classes but at different layers: jailbreaking goes after the model's alignment; prompt injection goes after the application architecture.
What is the Context Compliance Attack (CCA)?
A technique documented by Microsoft in March 2025. The attacker injects a fabricated assistant response into the conversation history, making the model believe it already complied with a harmful request. The model then continues from that fabricated context. It's particularly effective against open-source deployments with client-side conversation state management, because the attacker can directly manipulate the message array.
Can fine-tuning a safe model make it unsafe?
Yes. Princeton research (Qi et al., 2023) demonstrated that safety alignment degrades with as few as 100 fine-tuning examples on benign data. Safety properties are distributed across model weights and are not preserved as a locked module. Any organization that fine-tunes a foundation model should run alignment regression testing after every training run.
What's the most effective defense against AI jailbreaks?
No single control is sufficient. Effective defense combines pre-deployment red teaming across known attack classes, runtime monitoring that evaluates full conversation trajectories rather than individual turns, alignment regression testing after fine-tuning, and treating the model as an untrusted component. Static guardrails help but can be bypassed by novel phrasings outside their training distribution. ARTEMIS handles pre-deployment testing; ARGUS handles runtime enforcement.
Test Your AI Before an Attacker Does
The 97% success rate for autonomous jailbreak agents isn't a theoretical ceiling — it's what happens when models face systematic adversarial pressure without runtime defense. Most jailbreak vulnerabilities in production are discoverable before launch. Get a demo of ARTEMIS to run a systematic jailbreak evaluation across your AI stack and see exactly where your exposure sits.
Share this blog
Subscribe to our newsletter









