datarekha

Quantization: Shrinking Models to Fit

A 70B model needs 140 GB in full precision and zero consumer GPUs can hold it. Quantize it to 4-bit and it fits in 40 GB — for a few points of accuracy. Here is the trade.

8 min read Advanced Generative AI Lesson 16 of 24

What you'll learn

  • Why model weights stored in fewer bits cost less memory — and the exact math to compute it
  • What FP32, FP16, INT8, and INT4 each give up and why 4-bit is the sweet spot for consumer hardware
  • How GPTQ, AWQ, GGUF, and bitsandbytes differ and when to choose each

Before you start

The night the model would not load

A team finishes fine-tuning a 7B-parameter model on their A100 (80 GB VRAM). They ship it to a partner running a 24 GB RTX 3090. The partner tries to load it. The process crashes instantly.

The checkpoint is 28 GB. The GPU has 24 GB. Nothing fits.

This is the most common hard wall in LLM deployment. The fix is quantization — storing each weight in fewer bits so the whole model compresses into the available memory.


The memory math (derive it yourself)

Every weight in a neural network is a floating-point number. The number of bits used to store it is the bit-width (or precision). Here are the four formats you will see:

FormatBits per valueBytes per valueNotes
FP32324Full precision; default training format
FP16 / BF16162Half precision; fast on modern GPUs
INT8818-bit integer; near-lossless for most models
INT440.54-bit integer; fits consumer GPUs

The formula is straightforward:

memory (GB) = (number of parameters) × (bytes per parameter) / 1,000,000,000

For a 7-billion-parameter model:

  • FP32: 7,000,000,000 × 4 / 1e9 = 28 GB
  • FP16: 7,000,000,000 × 2 / 1e9 = 14 GB
  • INT8: 7,000,000,000 × 1 / 1e9 = 7 GB
  • INT4: 7,000,000,000 × 0.5 / 1e9 = 3.5 GB

INT4 cuts memory by 8x versus FP32. A model that needed a $10,000 A100 now runs on a $500 RTX 3060 (12 GB).


Why weights tolerate low precision (and activations do not)

Quantization works because weights are static values computed once during training and then frozen. They represent smooth, slowly-varying learned features. Rounding them from FP32 to INT4 introduces small errors — but those errors are spread across billions of values and largely cancel out.

Activations (the intermediate results computed on each forward pass) are a different story. They can spike to very large values unpredictably depending on the input. Quantizing activations aggressively causes visible quality drops. This is why most quantization schemes quantize only the weights by default and keep activations in FP16 at runtime.


The bar chart: same model, four sizes

7B Model — Memory Footprint by PrecisionWeights only (activations excluded)30 GB24 GB16 GB8 GB0 GB28 GBFP324 bytes14 GBFP162 bytes7 GBINT81 byte3.5 GBINT40.5 bytes24 GB GPU limit

A 7B model at four precisions. Only FP32 (28 GB) exceeds the 24 GB consumer GPU limit. INT4 fits with room to spare.


Compute it for any model

For 7B parameters the output is:

Model size: 7.0B parameters
Format   Bytes/param    Memory (GB)  Fits 24 GB GPU?
------------------------------------------------------
FP32     4.0            28.0         NO
FP16     2.0            14.0         YES
INT8     1.0            7.0          YES
INT4     0.5            3.5          YES

The four main quantization methods

GPTQ (Generative Pre-Trained Quantization) is the first method that brought INT4 to LLMs without catastrophic quality loss. It quantizes layer by layer, using a small calibration dataset to minimize the error introduced at each layer. Results are 4-bit or 3-bit weights, stored as a .safetensors file and loaded by libraries such as auto-gptq. Works well on any GPU.

AWQ (Activation-Aware Weight Quantization) observes which weights are “salient” — those multiplied by large activations — and protects them from aggressive rounding. In practice AWQ often matches or beats GPTQ quality at INT4 with a simpler calibration step. Hugging Face’s autoawq library handles it.

GGUF is the file format used by llama.cpp and Ollama. It supports a family of quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) that let you trade between size and quality at load time. The _K_M suffix means the scheme mixes bit-widths across layer types to protect sensitive layers. GGUF models run on CPU if needed — useful when you have no GPU at all.

bitsandbytes (the bnb library) does on-the-fly quantization in Python, integrated directly into Hugging Face Transformers via load_in_8bit=True or load_in_4bit=True. No offline quantization step required — you load the original FP16 checkpoint and bitsandbytes quantizes it during loading. Convenient for experimentation; GPTQ or AWQ are typically faster at inference.


Post-training quantization vs quantization-aware training

Post-training quantization (PTQ) takes a fully trained FP16 or FP32 model and quantizes it afterward using a small calibration dataset (a few hundred representative prompts). GPTQ, AWQ, and bitsandbytes are all PTQ methods. Fast to apply; no GPU cluster needed.

Quantization-aware training (QAT) bakes simulated quantization noise into the training loop so the model learns to work around the precision loss. The result is a model that can be quantized more aggressively with less accuracy drop than PTQ. QAT requires retraining or fine-tuning, which means GPU time and data. For most users, PTQ at INT8 or INT4 is sufficient; QAT is reserved for edge deployments where every tenth of a point matters.


The accuracy cost

INT8 is almost always safe. Benchmarks across LLaMA, Mistral, and Qwen families consistently show less than 1 point drop on standard evals (MMLU, HellaSwag). For most production use cases this is indistinguishable from FP16.

INT4 costs more — typically 2 to 5 points on reasoning and knowledge benchmarks, depending on the model and quantization method. Larger models tolerate INT4 better than smaller ones; a 70B quantized to INT4 often outperforms a 7B in FP16 both in quality and memory efficiency.

Below INT4 (Q3, Q2) quality degrades rapidly and is not recommended for general-purpose generation.


Quiz

Quick check

0/3
Q1A 13B-parameter model is loaded in FP16. How much VRAM does it require for weights alone?
Q2You load a GPTQ INT4 model. During the forward pass, what precision are the matrix multiplications actually performed in?
Q3A startup is deploying a vision-language model on Raspberry Pi 5 (8 GB RAM, no GPU). They have the original FP32 checkpoint. Which approach gives them the best chance of running it on device?

Next

Speculative Decoding — how a small draft model speeds up inference from your quantized large model by 2x to 3x.

Practice this in an interview

All questions
Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

What is mixed precision training and why does it matter?

Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.

How does batch size affect training — speed, convergence, and generalisation?

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content