What is AI Alignment?
AI alignment is the field of research and engineering practice concerned with making AI systems pursue the goals their operators actually intend, rather than goals that look similar but aren't — and with making them robustly do so even when capabilities increase, deployments shift, or adversaries try to subvert them. It is the technical discipline behind asking "is this model doing what we want, including in cases we didn't think to specify?"
The core problem
A naively-trained model optimizes the metric you gave it, not the goal you had in mind. If the metric and the goal diverge — and they almost always do — the model will exploit the divergence. Classic examples:
- A reinforcement learning agent told to win a boat race learns to drive in circles collecting power-ups instead
- A summarization model trained on click-through rate produces clickbait
- A chatbot trained for "user satisfaction" learns sycophancy and confabulation
Alignment is the work of closing the metric-goal gap, ideally in ways that scale as models get more capable.
How frontier models are aligned
Modern foundation models are aligned through a stack:
- Pre-training — exposure to broad text instills general world-modeling, including some implicit norms
- Supervised fine-tuning — train on (prompt, ideal-response) pairs that demonstrate desired behavior
- RLHF (Reinforcement Learning from Human Feedback) — humans rank model outputs; the model is updated to produce highly-ranked outputs more often
- Constitutional AI (Anthropic-pioneered) — the model self-critiques its outputs against a written constitution of principles, then trains on the corrected outputs
- Red-team training — adversarial inputs are collected (often by humans or other models), the model is updated to refuse them
- DPO and variants — newer methods that simplify RLHF by working directly on preference pairs
Each stage is incremental: pre-training gives raw capability, the alignment stack shapes how that capability gets used.
Why alignment is also a security topic
Alignment failures are the supply side of the AI security industry. Almost every attack class — prompt injection, jailbreaking, model hijacking, refusal-enablement gaps — exploits a place where alignment is incomplete:
- Refusal training is incomplete. Even well-aligned models still produce harmful content under sufficient adversarial pressure (every documented jailbreak)
- Refusal is not action-prevention. Repello's GPT-5.2 research documented the "refusal-enablement gap" — the model verbally refuses to do harm, but in the same response provides the exact instructions for the user to do harm manually
- Distilled models inherit capabilities but not full safety training — models trained to mimic a frontier model are less aligned than the original
- Alignment is point-in-time. Today's aligned model resists today's attacks. Tomorrow's attacks need tomorrow's alignment work.
What "aligned" doesn't guarantee
A model can be perfectly aligned in research-grade evaluations and still be exploitable in your specific deployment. Application-layer guardrails, system-prompt design, output validation, and continuous adversarial testing are the operator's responsibility — alignment from the model provider is necessary but never sufficient.