How does RLHF work and what problem does it solve?
RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.
How to think about it
A pretrained or instruction-tuned model minimises cross-entropy loss over text — it has no direct incentive to be helpful or safe. RLHF provides that incentive by encoding human judgements as a scalar reward signal and optimising the model against it.
The three-stage pipeline
Stage 1: Supervised fine-tuning (SFT) baseline
Human labellers write high-quality demonstration responses to sampled prompts. The model is fine-tuned on these (prompt, response) pairs. This gives a better starting policy than the raw pretrained model.
Stage 2: Reward model training
For each prompt, the SFT model generates several candidate responses. Human raters rank them (e.g., A > C > B). These pairwise preferences are used to train a reward model (RM) — a transformer that predicts a scalar score for any (prompt, response) pair. The RM is trained with a Bradley-Terry objective: the probability that response A is preferred over B is modelled as sigmoid(RM(A) - RM(B)).
Stage 3: RL optimisation (PPO)
The SFT policy is treated as an RL agent. For each prompt it generates a response; the RM scores that response. PPO (Proximal Policy Optimisation) updates the policy weights to maximise expected reward. A KL-divergence penalty term keeps the updated policy from straying too far from the SFT baseline, preventing reward hacking and catastrophic forgetting.
Objective = E[RM(response)] - β · KL(policy || SFT_policy)
Why KL regularisation matters
Without it, the policy quickly learns to game the reward model — producing responses that score high but are nonsensical or unsafe (reward hacking). The KL penalty acts as a conservative trust region.
Modern alternatives
DPO (Direct Preference Optimisation) recasts the RLHF objective analytically and shows that the optimal policy under KL-regularised RL can be derived directly from preference data — no separate RM, no PPO loop. It is simpler, more stable, and widely adopted in 2024-2026 pipelines.