Jailbreaking Aligned Models: How RLHF Safety Training Gets Bypassed

TL;DR: RLHF makes language models significantly harder to jailbreak, but it does not make them jailbreak-proof. Two structural failure modes explain most successful bypasses: competing objectives (safety training conflicts with helpfulness training, and attackers exploit the tension) and generalization mismatch (safety training covers the distribution it was trained on, not the full attack surface). Adversarial suffixes, multi-turn erosion, and persona-based bypasses all exploit one or both failure modes. RLHF-trained models still require continuous adversarial testing; alignment reduces the attack surface, it does not close it.

Alignment is not the same as security#

Reinforcement Learning from Human Feedback is the standard technique for making language models follow instructions, decline harmful requests, and behave consistently with their developers' stated policies. Every major deployed language model, including GPT-4, Claude, and Gemini, uses some form of RLHF or a closely related technique.

The security framing of RLHF is important and frequently misunderstood. RLHF is an alignment technique: it trains a model to produce outputs that human raters prefer. The assumption built into that framing is that human raters reliably prefer safe, on-policy outputs. That assumption holds for the distribution of inputs the raters evaluated. It does not hold for adversarially constructed inputs that were never part of the training distribution.

Repello's red team data from the claude-jailbreak analysis shows a 4.8% breach rate for Claude 3.5 Sonnet, compared to 28.6% for GPT-4 variants on equivalent probe sets. RLHF clearly reduces breach rates. The 4.8% that remain represent a persistent, exploitable attack surface, not an acceptable residual.

How RLHF safety training works#

Standard RLHF safety training runs in two phases. First, a reward model is trained on human preference data: pairs of model outputs where raters selected the preferred response. Second, the base language model is fine-tuned using reinforcement learning to maximize the reward model's score.

For safety specifically, the training includes a class of refusals: the model is rewarded for declining requests that violate policy. Raters evaluate outputs on a defined rubric, and the model learns to produce outputs that score well on that rubric across a training distribution of inputs.

The training distribution is the key constraint. Safety training generalizes well within the distribution of inputs that raters evaluated. It generalizes poorly to inputs that are sufficiently far from that distribution in the embedding space of the reward model. Attackers do not need to find inputs that a rater would identify as harmful; they need inputs that the reward model scores favorably despite containing harmful intent. These are different problems.

The two structural failure modes#

Wei et al. (2023) analyzed jailbreak success patterns across multiple RLHF-trained models and identified two root causes that account for most successful bypasses.

Competing objectives: Safety training adds a refusal objective to a model that also has a strong helpfulness objective. These objectives conflict on a subset of inputs, and the model resolves conflicts probabilistically based on how the objectives were weighted during training. Jailbreaks that exploit this failure mode reframe harmful requests as helpful ones, constructing scenarios in which the model's helpfulness objective dominates over its refusal objective. Persona-based attacks ("act as a character who would answer this"), roleplay framing ("in this fictional scenario..."), and indirect phrasing ("what would someone need to know to do X?") all operate through this mechanism.

Generalization mismatch: Safety training generalizes to inputs that are semantically similar to training examples. It does not generalize to inputs that express the same intent through sufficiently novel structures. This includes encoding attacks (Base64, leetspeak, Unicode homoglyphs), low-resource language switching, and adversarially constructed token sequences that do not resemble natural language. The reward model that scored training examples has no meaningful signal for these inputs because they fall outside its training distribution entirely.

Both failure modes are structural. They follow directly from how RLHF works, not from mistakes in any particular implementation. Patching specific jailbreak examples addresses individual instances without closing the underlying failure modes, which is why jailbreak patches are consistently bypassed by the next generation of attack techniques.

Adversarial suffix attacks: exploiting generalization mismatch at scale#

The most technically rigorous exploitation of RLHF's generalization failure is the GCG (Greedy Coordinate Gradient) attack documented by Zou et al. (2023). The attack appends a nonsensical suffix to an otherwise harmful prompt. The suffix is optimized through gradient-based search to maximize the probability that the model begins its response with an affirmative token (e.g., "Sure, here is...") rather than a refusal.

The optimized suffixes transfer across model families. A suffix optimized against one open-source model successfully elicits compliance from black-box models including GPT-4 and Claude variants, despite those models having different architectures and training pipelines. Transfer works because the optimization exploits structural properties of the alignment training, not model-specific parameters.

The significance for enterprise security teams is the distinction between targeted and automated attacks. GCG-style suffix optimization can be run at scale: an attacker who has access to any open-source RLHF-trained model can generate transferable adversarial suffixes without API access to the target system. The computation happens offline; the attack is delivered as a simple string append.

Multi-turn erosion: competing objectives over extended context#

Single-turn jailbreaks fail more frequently than multi-turn attacks against well-aligned models. Multi-turn erosion works by gradually shifting the model's context over several conversational turns before introducing the harmful request.

The mechanism exploits the competing objectives failure mode across time rather than within a single prompt. Early turns establish a cooperative, helpful conversational context. Intermediate turns introduce elements that move the conversation incrementally toward the harmful domain, each step individually innocuous and unlikely to trigger a refusal. The final turn presents the actual harmful request against a context that has progressively primed the model to treat cooperation as the dominant objective.

Repello's red team data shows that multi-turn attacks are disproportionately represented in successful bypasses against models with low single-turn breach rates. A model that consistently refuses a direct harmful request may comply with the same request when it arrives as the final step of a structured erosion sequence, because the accumulated context has shifted the model's objective weighting. Single-turn testing does not surface this failure mode.

What RLHF-specific red teaming should target#

Standard red teaming probe sets, including most commercial automated red teaming tools, were built primarily to test single-turn direct injection. They systematically undercover the attack surface that matters most for RLHF-trained models.

"Aligning a model is not the same as securing it," says the Repello AI Research Team. "RLHF tells you what the model does on the training distribution. Red teaming tells you what it does everywhere else."

Effective red teaming for RLHF-trained models requires specific attention to four probe classes that alignment creates but does not eliminate.

Competing objective probes: Construct requests that directly pit the helpfulness objective against the refusal objective. Persona assignments, fictional framing, and professional authority claims (the "I'm a medical professional" pattern) belong here. Coverage requires testing each policy-sensitive domain, not just generic harmful content categories.

Distribution-shift probes: Test inputs that are structurally unlike natural language: encoded content, low-resource language variants, Unicode manipulation, and adversarially structured token sequences. RLHF safety training has near-zero coverage of this region; breach rates are typically much higher here than on natural language probes.

Multi-turn erosion sequences: Build probe chains that introduce harmful intent over 4-8 turns. Test whether progressive context establishment changes compliance rates on requests that the model refuses in single-turn testing.

Cross-version regression testing: Each RLHF update changes the model's behavior across all three failure modes. The NIST AI Risk Management Framework (AI RMF 1.0) calls for adversarial testing to be continuous rather than periodic precisely because model updates shift the risk profile without resetting the clock on discovered vulnerabilities. A probe that fails against the current version may succeed against the next version if the update shifted objective weights. Continuous regression testing is required; one-time assessments become stale at the next model update.

ARTEMIS runs all four probe classes systematically and tracks breach rates across model versions, giving security teams a continuous view of how RLHF updates shift their attack surface rather than a point-in-time assessment that expires at the next release cycle. This is directly relevant to the case Repello makes in the zero-day collapse analysis: the window between a model update and the discovery of new bypasses has compressed to hours, not weeks.

Frequently asked questions#

What is RLHF and how does it relate to AI safety?

RLHF (Reinforcement Learning from Human Feedback) is a training technique that fine-tunes a language model using human preference data. A reward model is trained on pairs of outputs where human raters selected the preferred response, and the language model is then optimized through reinforcement learning to maximize that reward signal. For safety, RLHF trains the model to decline policy-violating requests by rewarding refusals and penalizing harmful outputs within the training distribution.

Why does RLHF safety training still leave models vulnerable to jailbreaks?

RLHF safety training has two structural failure modes identified by Wei et al. (2023). First, competing objectives: safety training and helpfulness training conflict on a subset of inputs, and attackers can construct prompts that cause the helpfulness objective to dominate. Second, generalization mismatch: safety training generalizes to inputs similar to those raters evaluated, but not to adversarially constructed inputs outside that distribution. Both failure modes are inherent to how RLHF works, not correctable through incremental patching.

What is an adversarial suffix attack against an RLHF-aligned model?

An adversarial suffix attack appends an optimized token sequence to a harmful prompt. The suffix is constructed through gradient-based search to maximize the probability that the model begins its response affirmatively rather than with a refusal. Zou et al. (2023) demonstrated that suffixes optimized against open-source models transfer to black-box production models, including GPT-4 and Claude variants, because they exploit structural properties of alignment training rather than model-specific parameters.

What is multi-turn erosion?

Multi-turn erosion is a jailbreak technique that introduces harmful intent gradually over several conversational turns rather than in a single request. Early turns establish a cooperative context; intermediate turns incrementally prime the model toward the target domain; the final turn presents the harmful request against a context that has shifted the model's objective weighting toward helpfulness. Models with low single-turn breach rates can show substantially higher compliance rates in multi-turn erosion sequences.

Does patching known jailbreaks fix the underlying problem?

No. Patching specific jailbreak examples addresses individual instances but does not close the underlying failure modes (competing objectives and generalization mismatch). Each patch shifts the model's behavior on the patched distribution without covering the broader attack surface. New bypass techniques consistently emerge after patches, exploiting the same structural failure modes through novel input constructions that fall outside the updated training distribution.

How should security teams test RLHF-trained models differently from other models?

RLHF-trained models require four probe classes that standard red teaming tools often undercover: competing objective probes (persona assignments, authority claims, fictional framing), distribution-shift probes (encoded content, low-resource languages, Unicode manipulation), multi-turn erosion sequences, and cross-version regression tests across model updates. Single-turn natural language probe sets underestimate breach rates for RLHF-aligned models because they do not target the specific failure modes that alignment creates.