What is transfer learning and when should you use full fine-tuning vs feature extraction?
The short answer
Transfer learning reuses weights pretrained on a large dataset as a starting point for a new task. Feature extraction freezes the backbone and trains only a new head; full fine-tuning updates all weights. The right choice depends on dataset size and how similar the new task is to the pretraining domain.
How to think about it
A pretrained network has already learned low-level features (edges, textures in vision; morphology and syntax in NLP). Transfer learning asks: can I reuse that knowledge instead of learning from scratch?
Three regimes
Decision guide
| Scenario | Best approach |
|---|---|
| Small dataset, similar domain | Feature extraction — few weights to train, low overfit risk |
| Large dataset, different domain | Full fine-tuning — backbone needs to adapt |
| LLM / large model, any size dataset | LoRA — fine-tuning at a fraction of the compute and memory |
| Tiny dataset, very different domain | Collect more data; fine-tuning a big model here will overfit |
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Feature extraction: freeze everything except the classifier head
for name, param in model.named_parameters():
if "classifier" not in name:
param.requires_grad = False
For full fine-tuning, use a lower learning rate for the backbone (e.g. 1e-5) than for the head (1e-4) — this is called discriminative learning rates and prevents catastrophic forgetting.