What is the difference between pretraining, fine-tuning, instruction-tuning, and RLHF?
Pretraining teaches a model general language structure by predicting tokens across a massive corpus; fine-tuning adapts the pretrained weights to a narrower task or domain using supervised data; instruction-tuning is supervised fine-tuning specifically on (instruction, response) pairs so the model follows directives; RLHF further aligns the model to human preferences by training a reward model on ranked responses and using it as a signal for policy optimisation with PPO or a similar algorithm.
How to think about it
These four stages are a rough pipeline. Each builds on the previous: you cannot RLHF a model that has not first learned language.
Pretraining
The model is trained on a large, diverse corpus (web text, books, code) using next-token prediction as the sole objective. No labels. No task. The model learns grammar, factual associations, reasoning patterns, and latent world structure purely from co-occurrence statistics. This stage is compute-intensive — training GPT-4-scale models costs tens of millions of dollars — but it is run once and the weights are reused downstream.
Supervised fine-tuning (SFT)
The pretrained weights are updated using a curated dataset of (input, desired output) pairs for a specific task or domain. SFT requires far less compute than pretraining but demands high-quality labeled data. Domain fine-tuning (e.g., medical notes) shifts the model’s knowledge distribution; task fine-tuning (e.g., classification) reshapes its output format.
Instruction-tuning
A specialised form of SFT where training examples are (instruction, response) pairs written to make the model follow natural-language directives. InstructGPT and FLAN are canonical examples. The model learns to interpret the prompt as a command rather than a completion target. Without instruction-tuning, pretrained models tend to continue the prompt rather than answer it.
RLHF (Reinforcement Learning from Human Feedback)
RLHF adds a preference signal on top of SFT.
- Collect comparisons: human annotators rank multiple model responses from best to worst.
- Train a reward model (RM): a separate model learns to predict which response a human would prefer.
- RL optimisation: the policy (the LLM) is updated via PPO (or similar) to maximise expected reward from the RM, subject to a KL penalty that prevents it from drifting too far from the SFT baseline.
RLHF is expensive and sensitive to reward-model quality. DPO (Direct Preference Optimisation) is a more recent alternative that skips the separate RM step.