How does LoRA work and why is it preferred over full fine-tuning for large models?
LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.
How to think about it
A 7B parameter model has roughly 28 GB of weights in float32. Full fine-tuning requires storing gradients and optimizer states on top of that — well beyond a single A100. LoRA sidesteps this entirely.
The core idea
For a pretrained weight matrix W (shape d x k), instead of updating W directly, LoRA introduces two small matrices:
Aof shaped x rBof shaper x k
where r is the rank, typically 4–64. The adapted forward pass becomes:
output = x @ W + x @ (A @ B) * alpha / r
W is frozen. Only A and B are trained. The product A @ B has at most r * (d + k) parameters vs d * k for the full matrix — a massive reduction.
from peft import get_peft_model, LoraConfig, TaskType
config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable %: 0.06
Why it works well
Research on pretrained transformers shows that the weight updates during fine-tuning have low intrinsic rank — the important signal lives in a low-dimensional subspace. LoRA exploits this directly.
LoRA vs full fine-tuning
| Full fine-tuning | LoRA | |
|---|---|---|
| Trainable params | 100 % | 0.01–1 % |
| GPU memory | Very high | Moderate |
| Serving | One model per task | Swap adapters on one base |
| Quality | Ceiling | Close to ceiling |
After training, A @ B can be merged back into W for zero-overhead inference: W_merged = W + A @ B * alpha / r.