Glossary/RLHF (Reinforcement Learning from Human Feedback)

What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is the technique used to fine-tune large language models on human preferences — humans rank model outputs, those rankings train a reward model, and reinforcement learning updates the language model to produce highly-ranked outputs more often. It is the standard alignment method behind ChatGPT, Claude, Gemini, and most production assistants since 2022.

How RLHF works

Three stages:

  1. Supervised fine-tuning (SFT). Start with a pre-trained base model. Fine-tune on (prompt, demonstration) pairs that show the desired response style. This produces a model that's already in the rough neighborhood of helpful behavior.

  2. Reward model training. Humans see pairs of model responses to the same prompt and pick which one is better. This produces a dataset of (prompt, chosen, rejected) triples. A reward model — typically a copy of the LLM with a value head — is trained to predict which of two responses humans would prefer.

  3. PPO fine-tuning. The SFT model is fine-tuned with reinforcement learning, using the reward model as the reward signal and PPO (Proximal Policy Optimization) as the optimization algorithm. The model is updated to produce outputs that maximize the reward model's score, with a KL-divergence penalty to prevent drifting too far from the SFT baseline.

The output is a model that produces responses humans tend to prefer — more helpful, less likely to refuse benign requests, more likely to refuse harmful ones, more conversational.

Why RLHF matters

RLHF was the breakthrough that turned raw GPT-3-style completion models into ChatGPT-style assistants. The base model had the capability; RLHF gave it the disposition.

But it's not a silver bullet:

RLHF alternatives gaining ground

Security implications

RLHF determines the model's refusal posture, which is the first line of defense in production deployments. Every attack against an LLM application is either:

Models with stronger RLHF (Claude, GPT-5) have higher refusal rates on adversarial inputs but still produce non-zero breach rates under sustained pressure. RLHF raises the cost of attack; it doesn't eliminate it.