What distinguishes QLoRA from LoRA?

For AI / LLM Engineer ML Engineer research-engineer

The short answer

QLoRA quantizes the frozen base model to 4-bit using NF4 and double quantization, then trains LoRA adapters on top, with paged optimizers to handle memory spikes. This dramatically lowers the memory footprint, enabling fine-tuning of large models on a single consumer GPU with little quality loss versus 16-bit LoRA.

How to think about it

Learn it properly LoRA & QLoRA fine-tuning

Keep practising

What is LoRA and how does it make fine-tuning parameter-efficient? How does LoRA work and why is it preferred over full fine-tuning for large models? What is model quantization, and how does it affect quality? What is GGUF, and what does a quantization tier like Q4_K_M mean? How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

All NLP & LLMs questions

Explore further

Fine-tuning: LoRA & QLoRA Quantization Fine-tune vs RAG: the decision

LoRA Quantization Continuous Batching KV Cache