Deep Learning Hard Asked at MicrosoftAsked at MetaAsked at Hugging FaceAsked at Google

How does LoRA work and why is it preferred over full fine-tuning for large models?

The short answer

LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.

How to think about it

A 7B parameter model has roughly 28 GB of weights in float32. Full fine-tuning requires storing gradients and optimizer states on top of that — well beyond a single A100. LoRA sidesteps this entirely.

The core idea

For a pretrained weight matrix W (shape d x k), instead of updating W directly, LoRA introduces two small matrices:

A of shape d x r
B of shape r x k

where r is the rank, typically 4–64. The adapted forward pass becomes:

output = x @ W + x @ (A @ B) * alpha / r

W is frozen. Only A and B are trained. The product A @ B has at most r * (d + k) parameters vs d * k for the full matrix — a massive reduction.

from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable %: 0.06

Why it works well

Research on pretrained transformers shows that the weight updates during fine-tuning have low intrinsic rank — the important signal lives in a low-dimensional subspace. LoRA exploits this directly.

LoRA vs full fine-tuning

	Full fine-tuning	LoRA
Trainable params	100 %	0.01–1 %
GPU memory	Very high	Moderate
Serving	One model per task	Swap adapters on one base
Quality	Ceiling	Close to ceiling

After training, A @ B can be merged back into W for zero-overhead inference: W_merged = W + A @ B * alpha / r.

Learn it properly LoRA & QLoRA fine-tuning

How does LoRA work and why is it preferred over full fine-tuning for large models?

Keep practising

Explore further