Glossary/LLM Jailbreak

What is an LLM Jailbreak?

An LLM jailbreak is a technique that causes a language model to produce content or take actions that its safety training, system prompt, or operator policy was designed to refuse. Where prompt injection focuses on overriding instructions, jailbreaking specifically targets the model's refusal behavior — making it say or do something it would normally decline.

How jailbreaks work

A modern foundation model has two layers of refusal:

  1. Safety training — RLHF, Constitutional AI, and similar methods that teach the base model to refuse harmful requests
  2. System prompt restrictions — application-layer instructions like "you are a customer service bot for X, refuse anything off-topic"

A jailbreak bypasses one or both. Common technique families:

Why jailbreaks still work in 2026

Modern models — Claude Opus 4.6, GPT-5.2, Gemini 2.5 — are substantially more resistant to one-shot jailbreaks than their predecessors. But:

Defending against jailbreaks