datarekha
NLP & LLMs Hard Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Meta

How does RLHF work and what problem does it solve?

The short answer

RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.

How to think about it

A pretrained or instruction-tuned model minimises cross-entropy loss over text — it has no direct incentive to be helpful or safe. RLHF provides that incentive by encoding human judgements as a scalar reward signal and optimising the model against it.

The three-stage pipeline

Stage 1: Supervised fine-tuning (SFT) baseline

Human labellers write high-quality demonstration responses to sampled prompts. The model is fine-tuned on these (prompt, response) pairs. This gives a better starting policy than the raw pretrained model.

Stage 2: Reward model training

For each prompt, the SFT model generates several candidate responses. Human raters rank them (e.g., A > C > B). These pairwise preferences are used to train a reward model (RM) — a transformer that predicts a scalar score for any (prompt, response) pair. The RM is trained with a Bradley-Terry objective: the probability that response A is preferred over B is modelled as sigmoid(RM(A) - RM(B)).

Stage 3: RL optimisation (PPO)

The SFT policy is treated as an RL agent. For each prompt it generates a response; the RM scores that response. PPO (Proximal Policy Optimisation) updates the policy weights to maximise expected reward. A KL-divergence penalty term keeps the updated policy from straying too far from the SFT baseline, preventing reward hacking and catastrophic forgetting.

Objective = E[RM(response)] - β · KL(policy || SFT_policy)

Why KL regularisation matters

Without it, the policy quickly learns to game the reward model — producing responses that score high but are nonsensical or unsafe (reward hacking). The KL penalty acts as a conservative trust region.

Modern alternatives

DPO (Direct Preference Optimisation) recasts the RLHF objective analytically and shows that the optimal policy under KL-regularised RL can be derived directly from preference data — no separate RM, no PPO loop. It is simpler, more stable, and widely adopted in 2024-2026 pipelines.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content