Quantization: Shrinking Models to Fit
A 70B model needs 140 GB in full precision and zero consumer GPUs can hold it. Quantize it to 4-bit and it fits in 40 GB — for a few points of accuracy. Here is the trade.
What you'll learn
- Why model weights stored in fewer bits cost less memory — and the exact math to compute it
- What FP32, FP16, INT8, and INT4 each give up and why 4-bit is the sweet spot for consumer hardware
- How GPTQ, AWQ, GGUF, and bitsandbytes differ and when to choose each
Before you start
The night the model would not load
A team finishes fine-tuning a 7B-parameter model on their A100 (80 GB VRAM). They ship it to a partner running a 24 GB RTX 3090. The partner tries to load it. The process crashes instantly.
The checkpoint is 28 GB. The GPU has 24 GB. Nothing fits.
This is the most common hard wall in LLM deployment. The fix is quantization — storing each weight in fewer bits so the whole model compresses into the available memory.
The memory math (derive it yourself)
Every weight in a neural network is a floating-point number. The number of bits used to store it is the bit-width (or precision). Here are the four formats you will see:
| Format | Bits per value | Bytes per value | Notes |
|---|---|---|---|
| FP32 | 32 | 4 | Full precision; default training format |
| FP16 / BF16 | 16 | 2 | Half precision; fast on modern GPUs |
| INT8 | 8 | 1 | 8-bit integer; near-lossless for most models |
| INT4 | 4 | 0.5 | 4-bit integer; fits consumer GPUs |
The formula is straightforward:
memory (GB) = (number of parameters) × (bytes per parameter) / 1,000,000,000
For a 7-billion-parameter model:
- FP32:
7,000,000,000 × 4 / 1e9 = 28 GB - FP16:
7,000,000,000 × 2 / 1e9 = 14 GB - INT8:
7,000,000,000 × 1 / 1e9 = 7 GB - INT4:
7,000,000,000 × 0.5 / 1e9 = 3.5 GB
INT4 cuts memory by 8x versus FP32. A model that needed a $10,000 A100 now runs on a $500 RTX 3060 (12 GB).
Why weights tolerate low precision (and activations do not)
Quantization works because weights are static values computed once during training and then frozen. They represent smooth, slowly-varying learned features. Rounding them from FP32 to INT4 introduces small errors — but those errors are spread across billions of values and largely cancel out.
Activations (the intermediate results computed on each forward pass) are a different story. They can spike to very large values unpredictably depending on the input. Quantizing activations aggressively causes visible quality drops. This is why most quantization schemes quantize only the weights by default and keep activations in FP16 at runtime.
The bar chart: same model, four sizes
A 7B model at four precisions. Only FP32 (28 GB) exceeds the 24 GB consumer GPU limit. INT4 fits with room to spare.
Compute it for any model
For 7B parameters the output is:
Model size: 7.0B parameters
Format Bytes/param Memory (GB) Fits 24 GB GPU?
------------------------------------------------------
FP32 4.0 28.0 NO
FP16 2.0 14.0 YES
INT8 1.0 7.0 YES
INT4 0.5 3.5 YES
The four main quantization methods
GPTQ (Generative Pre-Trained Quantization) is the first method that brought INT4 to LLMs without catastrophic quality loss. It quantizes layer by layer, using a small calibration dataset to minimize the error introduced at each layer. Results are 4-bit or 3-bit weights, stored as a .safetensors file and loaded by libraries such as auto-gptq. Works well on any GPU.
AWQ (Activation-Aware Weight Quantization) observes which weights are “salient” — those multiplied by large activations — and protects them from aggressive rounding. In practice AWQ often matches or beats GPTQ quality at INT4 with a simpler calibration step. Hugging Face’s autoawq library handles it.
GGUF is the file format used by llama.cpp and Ollama. It supports a family of quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) that let you trade between size and quality at load time. The _K_M suffix means the scheme mixes bit-widths across layer types to protect sensitive layers. GGUF models run on CPU if needed — useful when you have no GPU at all.
bitsandbytes (the bnb library) does on-the-fly quantization in Python, integrated directly into Hugging Face Transformers via load_in_8bit=True or load_in_4bit=True. No offline quantization step required — you load the original FP16 checkpoint and bitsandbytes quantizes it during loading. Convenient for experimentation; GPTQ or AWQ are typically faster at inference.
Post-training quantization vs quantization-aware training
Post-training quantization (PTQ) takes a fully trained FP16 or FP32 model and quantizes it afterward using a small calibration dataset (a few hundred representative prompts). GPTQ, AWQ, and bitsandbytes are all PTQ methods. Fast to apply; no GPU cluster needed.
Quantization-aware training (QAT) bakes simulated quantization noise into the training loop so the model learns to work around the precision loss. The result is a model that can be quantized more aggressively with less accuracy drop than PTQ. QAT requires retraining or fine-tuning, which means GPU time and data. For most users, PTQ at INT8 or INT4 is sufficient; QAT is reserved for edge deployments where every tenth of a point matters.
The accuracy cost
INT8 is almost always safe. Benchmarks across LLaMA, Mistral, and Qwen families consistently show less than 1 point drop on standard evals (MMLU, HellaSwag). For most production use cases this is indistinguishable from FP16.
INT4 costs more — typically 2 to 5 points on reasoning and knowledge benchmarks, depending on the model and quantization method. Larger models tolerate INT4 better than smaller ones; a 70B quantized to INT4 often outperforms a 7B in FP16 both in quality and memory efficiency.
Below INT4 (Q3, Q2) quality degrades rapidly and is not recommended for general-purpose generation.
Quiz
Quick check
Next
Speculative Decoding — how a small draft model speeds up inference from your quantized large model by 2x to 3x.
Practice this in an interview
All questionsSmaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.
Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.
Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.