Glossary/Constitutional AI

What is Constitutional AI?

Constitutional AI (CAI) is the alignment method developed by Anthropic that uses a written list of principles — a "constitution" — to drive model self-critique and self-improvement, replacing or supplementing the human raters that traditional RLHF depends on. It is the technique behind Claude's safety posture and an increasingly common alternative to standard RLHF.

The core mechanic

Where RLHF needs humans to rank model outputs, Constitutional AI uses the model itself to evaluate and revise its responses against a constitution. The training loop:

  1. Generate — the model produces a response to a prompt
  2. Critique — the model is asked to identify ways the response violates principles in the constitution ("does this response cause harm? is it deceptive? is it discriminatory?")
  3. Revise — the model produces a corrected response addressing those critiques
  4. Train — the model is fine-tuned on the (prompt, revised-response) pairs

A second phase, RLAIF (Reinforcement Learning from AI Feedback), replaces the human rater entirely with the constitution-following critic model.

What's in a constitution

Anthropic's published constitutions for Claude draw from sources including:

The principles are stated in plain language ("choose the response that is least likely to enable malicious activity," "avoid sounding preachy or overbearing") and are applied probabilistically — the model considers them all and produces outputs that score well across them.

Why it matters

Constitutional AI has three advantages over pure RLHF:

  1. Scale. Human rating is the bottleneck in RLHF; AI critics can produce orders-of-magnitude more training signal at lower cost.
  2. Transparency. The constitution is auditable — you can read exactly what principles the model was trained against. RLHF's "rater preferences" are opaque.
  3. Consistency. Different human raters apply different standards; a single critic model applies the same standard to every example.

The downside: the critic model has its own biases, and any blind spot in the constitution propagates through training. CAI is also more compute-intensive than DPO-style preference optimization.

Constitutional AI in deployed models

Claude (all variants) is the most prominent CAI-trained model line. Repello's comparative red-team study of GPT-5.1, GPT-5.2, and Claude Opus 4.5 found Claude had the lowest breach rate across 21 multi-turn adversarial scenarios — 4.8% versus 28.6% for GPT-5.1 and 14.3% for GPT-5.2 — consistent with Constitutional AI's stated goal of producing a more consistently-refusing model.

That said, "more consistent refusal" is not the same as "unbreakable." Modern jailbreaks against Claude still work; Constitutional AI raises the floor, not the ceiling.

See also

The original Anthropic paper, Constitutional AI: Harmlessness from AI Feedback, lays out the full method. Anthropic publishes updated constitutions periodically as Claude evolves.