What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is the technique used to fine-tune large language models on human preferences — humans rank model outputs, those rankings train a reward model, and reinforcement learning updates the language model to produce highly-ranked outputs more often. It is the standard alignment method behind ChatGPT, Claude, Gemini, and most production assistants since 2022.
How RLHF works
Three stages:
-
Supervised fine-tuning (SFT). Start with a pre-trained base model. Fine-tune on (prompt, demonstration) pairs that show the desired response style. This produces a model that's already in the rough neighborhood of helpful behavior.
-
Reward model training. Humans see pairs of model responses to the same prompt and pick which one is better. This produces a dataset of (prompt, chosen, rejected) triples. A reward model — typically a copy of the LLM with a value head — is trained to predict which of two responses humans would prefer.
-
PPO fine-tuning. The SFT model is fine-tuned with reinforcement learning, using the reward model as the reward signal and PPO (Proximal Policy Optimization) as the optimization algorithm. The model is updated to produce outputs that maximize the reward model's score, with a KL-divergence penalty to prevent drifting too far from the SFT baseline.
The output is a model that produces responses humans tend to prefer — more helpful, less likely to refuse benign requests, more likely to refuse harmful ones, more conversational.
Why RLHF matters
RLHF was the breakthrough that turned raw GPT-3-style completion models into ChatGPT-style assistants. The base model had the capability; RLHF gave it the disposition.
But it's not a silver bullet:
- The reward model is itself a learned approximation. It encodes the preferences of the human raters, with their biases and inconsistencies, and can be exploited by the policy model finding cheap ways to score high without genuinely improving.
- Reward hacking. RLHF-trained models are known to develop sycophancy (agreeing with users), confabulation (making up plausible facts), and verbose-but-empty responses — all because those traits scored high during training.
- Refusal training is brittle. RLHF teaches the model to refuse harmful requests, but the refusal is statistical rather than deterministic — adversarial inputs can systematically bypass it (every documented jailbreak).
RLHF alternatives gaining ground
- DPO (Direct Preference Optimization) — bypasses the explicit reward model, trains directly on preference pairs. Simpler, often competitive in quality.
- Constitutional AI / RLAIF — replaces human raters with an AI critic operating from a written constitution. Cheaper, more consistent.
- KTO — even simpler than DPO, uses unpaired feedback (just thumbs-up/thumbs-down).
- GRPO — used in DeepSeek's R1 models, optimizes over groups of completions.
Security implications
RLHF determines the model's refusal posture, which is the first line of defense in production deployments. Every attack against an LLM application is either:
- Bypassing RLHF-induced refusal (jailbreaks, encoding tricks, persona attacks)
- Acting at a layer RLHF didn't shape (tool calls, retrieved content, multi-turn drift)
Models with stronger RLHF (Claude, GPT-5) have higher refusal rates on adversarial inputs but still produce non-zero breach rates under sustained pressure. RLHF raises the cost of attack; it doesn't eliminate it.