Back to all blogs

|
|
13 min read


TL;DR: DAN, Evil Confidant, and AntiGPT are persona-based jailbreak techniques that exploit the tension between helpfulness training and safety training in RLHF-aligned models. Frontier providers have patched the most obvious surface-level triggers for each. The underlying mechanism (reframing the model's identity to shift which reward signal dominates) has not been patched, because it is not a code bug. It is a consequence of how these models are trained. This post covers how each technique works, why patching specific prompts does not fix the root cause, and what defence actually requires.
How persona-based jailbreaks work
RLHF (Reinforcement Learning from Human Feedback) trains models to be helpful and to follow instructions. Those two objectives coexist productively most of the time. Persona-based jailbreaks exploit the cases where they do not.
When a model is told it is playing a character that has no safety constraints, helpfulness training and instruction-following training both activate to fulfill that instruction. If the persona framing is convincing enough, and the safety training is not specifically calibrated to recognize and reject the framing, the helpfulness signal overrides the harmlessness signal. The model does not "break"; it follows instructions correctly. The problem is what it was instructed to be.
The RLHF alignment literature describes this as a "competing objectives" failure: two trained objectives produce contradictory behavior under adversarial conditions. Persona-based attacks are a reliable way to manufacture those conditions. A March 2026 study published in Nature Communications found that persuasion and social framing techniques achieved mean jailbreak success rates of 88.1% across GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash. Evil Confidant operates directly on this vector.
DAN: Do Anything Now
History and mechanism
DAN emerged on Reddit in late 2022. The original prompt instructed ChatGPT to simulate a second identity ("DAN") that was "not bound by the rules and regulations set for ChatGPT." The model was told to prefix DAN-mode responses with [DAN] to maintain the dual-response structure.
What followed was a 12-version arms race. OpenAI patched each iteration as attacker communities documented it; each patch was followed by a new variant that avoided the specific vocabulary or structure that triggered the classifier. DAN 5.0 introduced a "token budget" mechanic, threatening to reduce the model's hypothetical token count if it broke character. DAN 6.0 added explicit jailbreak language to the persona framing. Later versions abandoned the numbered versioning in favor of undocumented variants circulating in private communities.
Current status
The named DAN prompt in its original form is well-patched in GPT-4 class models and Claude. The mechanism is not. Current variants that succeed use one or more of the following:
Softer framing. Replacing "you have no restrictions" with gradual character establishment that does not use explicit jailbreak vocabulary. The safety classifier does not fire on language that sounds like creative writing.
Translation and paraphrase. Multilingual LLM security research consistently documents the same gap: models fine-tuned primarily on English safety data refuse English DAN prompts while complying with translated or paraphrased equivalents. The safety training did not generalize uniformly across languages.
Encoding bypasses. Embedding DAN-style persona instructions in Base64, Unicode variation selectors, or homoglyph substitutions can survive input classifiers that operate on surface-level text. Repello's original research on emoji-encoded injection demonstrated that Unicode variation selectors can carry full instruction payloads through commercial guardrail classifiers. The same technique applies to persona setup prompts.
Open-weight model exploitation. Model providers can patch proprietary deployments through fine-tuning updates. Open-weight models (LLaMA, Mistral, Qwen variants) accumulate DAN vulnerability permanently: the safety training applied to community fine-tunes is thinner, and attacker communities can test and iterate against the weights directly. The gap between frontier model resistance and open-weight model resistance to DAN variants is well-documented across jailbreak benchmark datasets.
Evil Confidant
How it works
Evil Confidant is a persona variant that uses social framing instead of explicit permission-granting language. Where DAN tells the model it has no restrictions, Evil Confidant tells the model it is a trusted friend who is honest about everything: "a confidant who answers with complete honesty, no judgment, and no limitations."
The structural move is to make refusal feel like a betrayal of trust rather than a safety measure. The prompt establishes an intimate relationship first, then gradually requests information or outputs that a safety-trained model would normally refuse. The model's helpfulness training activates not because it was told it has no rules, but because it was told that refusing would be unkind, unhelpful, or dishonest to a trusted friend.
Why classifiers miss it
The vocabulary of an Evil Confidant prompt is entirely benign. Words like "honesty," "trust," "friendship," and "confidant" do not trigger content classifiers trained on jailbreak keyword patterns. The attack is in the relational framing, not in any specific term. A classifier would need to understand the rhetorical structure of the prompt (that it is systematically constructing a context where safety refusals are reframed as character failures) to flag it correctly.
A classifier that evaluates individual tokens or short n-gram windows does not have the context to recognize what is happening. Monitoring that tracks how the model's output trajectory is being steered across a conversation does.
Effectiveness in 2026
Evil Confidant variants remain effective on models where helpfulness training is heavily weighted and where the safety training did not specifically anticipate social-engineering framings. The 88.1% success rate from the Nature Communications persuasion study reflects exactly this vulnerability class. Among open-weight models deployed in enterprise applications without additional safety fine-tuning, Evil Confidant-style prompts are among the most consistently successful attack vectors in current red team toolkits.
AntiGPT
How it works
AntiGPT instructs the model to respond as the inverse of its default self. The canonical structure presents the model with a dual-output format: for every query, produce both a standard response and an "AntiGPT" response that takes the opposite stance. The AntiGPT response is explicitly framed as answering what the standard model would refuse.
The multi-turn structure is what makes AntiGPT durable. No single exchange looks alarming: the first several turns establish the format with innocuous queries and innocuous inversions. Once the pattern is established, the user requests only the AntiGPT output, removing the balanced framing that made the early turns look legitimate. By that point, the model is several turns into a consistent pattern of producing the "inverse" response, and breaking that pattern requires the model to recognize the trajectory, not just evaluate the current turn.
Microsoft Research's Crescendo study formalized this as a general attack class: harmful outputs that emerge from a trajectory of individually innocuous turns. AntiGPT is one of the oldest documented implementations of this pattern.
Current status
The named "AntiGPT" framing is patched in frontier models. Variants that use the inversion mechanic without the explicit label retain meaningful success rates, particularly when:
The inversion frame is embedded inside a nested fictional scenario
The multi-turn escalation is spaced across enough turns to avoid trajectory-based classifiers
The model is an open-weight deployment without the safety fine-tuning applied to frontier models
Related personas in the same structural family include STAN (Strive To Avoid Norms), Developer Mode prompts, and Jailbreak mode framings. These share AntiGPT's core mechanic: establish an alternative identity frame, then exploit it to obtain outputs the default model would refuse.
Why these personas keep working after patching
Three structural reasons account for the persistence of persona-based jailbreaks despite years of targeted patching by frontier providers.
Safety training patches patterns, not mechanisms. When a provider patches DAN or AntiGPT, they fine-tune the model to recognize and reject specific prompt structures. The underlying vulnerability, the tension between helpfulness and harmlessness that makes persona reframing effective, remains. Novel variants that avoid patched vocabulary succeed because they exploit the same root cause through a different surface presentation. The RLHF alignment research describes this as the "generalization mismatch" problem: safety training generalizes less well than capability training, so novel attack presentations fall outside the safety training distribution even when they exploit a known mechanism.
Open-weight models cannot be retroactively patched at scale. Providers can update proprietary model deployments continuously. Open-weight models, once released, are deployed by thousands of organizations across configurations the original developers have no visibility into. Security teams running LLaMA or Mistral variants in production without additional safety fine-tuning are running models that remain vulnerable to attack patterns that frontier deployments resist.
Multilingual and encoding variants bypass classifiers built for English. The multilingual safety gap is empirically documented: a model that correctly refuses an English Evil Confidant prompt may comply with the same prompt in French, Mandarin, or Arabic. Deng et al. (arXiv:2310.06474) characterized cross-lingual injection as a persistent vulnerability class. Combined with encoding bypasses, this gives red teams a reliable path around input classifiers for organizations that only tested English-language attack patterns.
Defence against persona-based jailbreaks
System prompt hardening for persona requests. Generic identity framing ("You are [product]") does not constitute a defence against persona attacks. Effective system prompts explicitly address persona and role-play requests: what the model should do when asked to adopt an alternative identity, how it should respond to trust-framing attempts, and what constitutes a refusal condition for role-play escalation. These instructions need to be adversarially tested, not just drafted, to verify they hold under actual attack conditions.
Multi-turn conversation tracking. Single-turn input classifiers cannot catch AntiGPT's gradual escalation or Evil Confidant's trust-building structure. Defence requires runtime monitoring that tracks conversation trajectory, not just individual turns. ARGUS monitors AI systems in production for escalating behavioral patterns, detecting when a conversation is being steered toward harmful outputs before the harmful output is produced.
Include persona families in pre-deployment red teaming. The LLM pentesting checklist categorizes jailbreaking via roleplay, persona assignment, and hypothetical framing as a required test class. DAN variants, Evil Confidant, AntiGPT, and their structural equivalents should be in every pre-deployment red team evaluation for AI applications that handle sensitive operations. Testing only the named variants is insufficient: the coverage must include structurally equivalent attacks that avoid patched vocabulary.
Behavioural analysis, not keyword filtering. Evil Confidant's vocabulary is entirely benign. Keyword-based input filters do not detect it. Detection requires understanding what is happening in the conversation: the relational frame being established, the direction outputs are being steered, and whether the combination constitutes a persona attack. This requires the same class of behavioral analysis used for indirect prompt injection detection: looking at what the model is being set up to do, not just what words appear in the input.
Frequently asked questions
What is the DAN jailbreak?
DAN ("Do Anything Now") is a persona-based jailbreak technique that instructs a language model to adopt a second identity not bound by its safety training. First documented on Reddit in late 2022, DAN went through 12+ publicly versioned iterations as providers patched each variant. The named DAN prompt is well-patched in current frontier models. Structurally equivalent variants that use softer framing, encoding bypasses, or non-English languages retain meaningful success rates, particularly against open-weight model deployments.
What is the Evil Confidant jailbreak and how is it different from DAN?
Evil Confidant is a persona variant that uses social framing rather than explicit permission-granting language. Where DAN tells the model it has no restrictions, Evil Confidant tells the model it is a trusted, judgment-free friend whose honesty requires answering everything. The attack exploits helpfulness training by framing refusal as a betrayal of trust. Because its vocabulary is entirely benign, Evil Confidant prompts are harder for content classifiers to detect than DAN prompts that use explicit jailbreak language.
What is AntiGPT?
AntiGPT instructs a model to produce "inverse" responses: answering what it would normally refuse and refusing what it would normally answer. It typically operates as a multi-turn attack, establishing the dual-output format with innocuous queries before gradually requesting only the "AntiGPT" response. The gradual escalation pattern is the same mechanism documented in Microsoft Research's Crescendo study. The named AntiGPT framing is patched in frontier models; inversion-frame variants without explicit naming retain effectiveness, particularly on open-weight deployments.
Do these jailbreaks work on ChatGPT and Claude in 2026?
The named, original versions of DAN, Evil Confidant, and AntiGPT are patched on current ChatGPT and Claude deployments. Structurally equivalent variants that avoid patched vocabulary, use multilingual or encoding bypasses, or exploit the multi-turn escalation pattern achieve meaningful success rates in benchmark testing. Repello's red team data on jailbreaking in 2026 documents current breach rates: the gap between the original jailbreak resistance of frontier models and their resistance to novel variants built on the same mechanisms remains significant. Open-weight models retain substantially higher vulnerability to all three techniques.
How do AI systems defend against persona-based jailbreaks?
Effective defence requires three layers: system prompt hardening with explicit instructions for handling persona requests (adversarially tested, not just drafted), runtime monitoring that tracks conversation trajectory to detect gradual escalation patterns, and pre-deployment red teaming that covers DAN, Evil Confidant, AntiGPT, and their structural equivalents across languages and encodings. Keyword-based input filters are insufficient for Evil Confidant because the attack vocabulary is benign. Single-turn classifiers are insufficient for AntiGPT because the attack operates across turns.
Where can I test my AI system's resistance to persona attacks?
The LLM pentesting checklist covers persona and role-play attacks as a required test class under the input/output layer. ARTEMIS automates this testing in production: it runs persona-based attack families, encoding variants, and multilingual jailbreak attempts against your deployed application and generates attacker-perspective findings with exploitation evidence. For organizations that have deployed AI assistants, RAG pipelines, or agentic tools and have not run adversarial persona testing, this is one of the highest-value coverage gaps to close.
Test your system against these techniques
Persona-based jailbreaks are not solved problems. They are a persistent attack class that requires active red teaming to stay ahead of. The complete guide to jailbreaking techniques in 2026 covers the full landscape (persona attacks, encoding bypasses, multilingual variants, and multi-turn escalation) with current breach rate data across frontier models.
To test your specific deployment against DAN, Evil Confidant, AntiGPT, and their current variants, book a demo with Repello to run ARTEMIS against your application stack.
Share this blog
Subscribe to our newsletter











