datarekha
Deep Learning Hard Asked at MicrosoftAsked at MetaAsked at Hugging FaceAsked at Google

How does LoRA work and why is it preferred over full fine-tuning for large models?

The short answer

LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.

How to think about it

A 7B parameter model has roughly 28 GB of weights in float32. Full fine-tuning requires storing gradients and optimizer states on top of that — well beyond a single A100. LoRA sidesteps this entirely.

The core idea

For a pretrained weight matrix W (shape d x k), instead of updating W directly, LoRA introduces two small matrices:

  • A of shape d x r
  • B of shape r x k

where r is the rank, typically 4–64. The adapted forward pass becomes:

output = x @ W + x @ (A @ B) * alpha / r

W is frozen. Only A and B are trained. The product A @ B has at most r * (d + k) parameters vs d * k for the full matrix — a massive reduction.

from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable %: 0.06

Why it works well

Research on pretrained transformers shows that the weight updates during fine-tuning have low intrinsic rank — the important signal lives in a low-dimensional subspace. LoRA exploits this directly.

LoRA vs full fine-tuning

Full fine-tuningLoRA
Trainable params100 %0.01–1 %
GPU memoryVery highModerate
ServingOne model per taskSwap adapters on one base
QualityCeilingClose to ceiling

After training, A @ B can be merged back into W for zero-overhead inference: W_merged = W + A @ B * alpha / r.

Learn it properly LoRA & QLoRA fine-tuning

Keep practising

All Deep Learning questions

Explore further

Skip to content