MLOps Medium

What is model quantization, and how does it affect quality?

For MLOps Engineer AI / LLM Engineer ML Engineer

The short answer

Quantization stores weights and sometimes activations in lower-precision formats to cut memory and speed up inference, ranging from 16-bit (FP16 or BF16) down to INT8 and INT4. Lower precision saves more memory but can degrade accuracy; techniques like calibration, GPTQ, AWQ, and keeping sensitive layers higher-precision minimize the loss.

How to think about it

Quantization stores weights and sometimes activations in lower-precision formats to cut memory and speed up inference, ranging from 16-bit (FP16 or BF16) down to INT8 and INT4. Lower precision saves more memory but can degrade accuracy; techniques like calibration, GPTQ, AWQ, and keeping sensitive layers higher-precision minimize the loss.

Learn it properly Quantization

Keep practising

What is GGUF, and what does a quantization tier like Q4_K_M mean? What distinguishes QLoRA from LoRA? What is mixed precision training and why does it matter? How would you reduce the cost of serving an ML or LLM model in production without hurting quality? Why are smaller language models (SLMs) sometimes preferable to larger ones?

All MLOps questions

Explore further

Distillation Fine-tuning: LoRA & QLoRA Self-hosting with vLLM

Quantization Inference Weight Initialization Pooling