NLP & LLMs Hard Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Meta

How does RLHF work and what problem does it solve?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.

How to think about it

A pretrained or instruction-tuned model minimises cross-entropy loss over text — it has no direct incentive to be helpful or safe. RLHF provides that incentive by encoding human judgements as a scalar reward signal and optimising the model against it.

The three-stage pipeline

Stage 1: Supervised fine-tuning (SFT) baseline

Human labellers write high-quality demonstration responses to sampled prompts. The model is fine-tuned on these (prompt, response) pairs. This gives a better starting policy than the raw pretrained model.

Stage 2: Reward model training

For each prompt, the SFT model generates several candidate responses. Human raters rank them (e.g., A > C > B). These pairwise preferences are used to train a reward model (RM) — a transformer that predicts a scalar score for any (prompt, response) pair. The RM is trained with a Bradley-Terry objective: the probability that response A is preferred over B is modelled as sigmoid(RM(A) - RM(B)).

Stage 3: RL optimisation (PPO)

The SFT policy is treated as an RL agent. For each prompt it generates a response; the RM scores that response. PPO (Proximal Policy Optimisation) updates the policy weights to maximise expected reward. A KL-divergence penalty term keeps the updated policy from straying too far from the SFT baseline, preventing reward hacking and catastrophic forgetting.

Objective = E[RM(response)] - β · KL(policy || SFT_policy)

Why KL regularisation matters

Without it, the policy quickly learns to game the reward model — producing responses that score high but are nonsensical or unsafe (reward hacking). The KL penalty acts as a conservative trust region.

Modern alternatives

DPO (Direct Preference Optimisation) recasts the RLHF objective analytically and shows that the optimal policy under KL-regularised RL can be derived directly from preference data — no separate RM, no PPO loop. It is simpler, more stable, and widely adopted in 2024-2026 pipelines.