What is model quantization, and how does it affect quality?

Quantization stores weights and sometimes activations in lower-precision formats to cut memory and speed up inference, ranging from 16-bit (FP16 or BF16) down to INT8 and INT4. Lower precision saves more memory but can degrade accuracy; techniques like calibration, GPTQ, AWQ, and keeping sensitive layers higher-precision minimize the loss.

What is GGUF, and what does a quantization tier like Q4_K_M mean?

GGUF is a single-file format for running LLMs locally, used by llama.cpp and Ollama. Unlike training-oriented formats, it packs weights, tokenizer, and metadata into one memory-mappable file optimized for inference on CPU or partial GPU. Q4_K_M describes the quantization: roughly 4 bits per weight (vs 16 for FP16), using the k-quant method, medium variant, which protects the most important tensors at higher precision. It is the community default because it keeps almost all of the model's quality at about a quarter of the FP16 size.

What distinguishes QLoRA from LoRA?

QLoRA quantizes the frozen base model to 4-bit using NF4 and double quantization, then trains LoRA adapters on top, with paged optimizers to handle memory spikes. This dramatically lowers the memory footprint, enabling fine-tuning of large models on a single consumer GPU with little quality loss versus 16-bit LoRA.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

Quantization: Shrinking Models to Fit — Generative AI

The night the model would not load

A team finishes fine-tuning a 7B-parameter model on their A100 (80 GB VRAM). They ship it to a partner running a 24 GB RTX 3090. The partner tries to load it. The process crashes instantly.

The checkpoint is 28 GB. The GPU has 24 GB. Nothing fits.

This is the most common hard wall in LLM deployment. The fix is quantization — storing each weight in fewer bits so the whole model compresses into the available memory.

The memory math (derive it yourself)

Every weight in a neural network is a floating-point number. The number of bits used to store it is the bit-width (or precision). Here are the four formats you will see:

Format	Bits per value	Bytes per value	Notes
FP32	32	4	Full precision; default training format
FP16 / BF16	16	2	Half precision; fast on modern GPUs
INT8	8	1	8-bit integer; near-lossless for most models
INT4	4	0.5	4-bit integer; fits consumer GPUs

The formula is straightforward:

memory (GB) = (number of parameters) × (bytes per parameter) / 1,000,000,000

For a 7-billion-parameter model:

FP32: 7,000,000,000 × 4 / 1e9 = 28 GB
FP16: 7,000,000,000 × 2 / 1e9 = 14 GB
INT8: 7,000,000,000 × 1 / 1e9 = 7 GB
INT4: 7,000,000,000 × 0.5 / 1e9 = 3.5 GB

INT4 cuts memory by 8x versus FP32. A model that needed a $10,000 A100 now runs on a $500 RTX 3060 (12 GB).

Why weights tolerate low precision (and activations do not)

Quantization works because weights are static values computed once during training and then frozen. They represent smooth, slowly-varying learned features. Rounding them from FP32 to INT4 introduces small errors — but those errors are spread across billions of values and largely cancel out.

Activations (the intermediate results computed on each forward pass) are a different story. They can spike to very large values unpredictably depending on the input. Quantizing activations aggressively causes visible quality drops. This is why most quantization schemes quantize only the weights by default and keep activations in FP16 at runtime.

The bar chart: same model, four sizes

A 7B model at four precisions. Only FP32 (28 GB) exceeds the 24 GB consumer GPU limit. INT4 fits with room to spare.

Compute it for any model

# Change params_b to any model size (in billions)
params_b = 7.0

params = params_b * 1e9

formats = [
    ("FP32",  4.0),
    ("FP16",  2.0),
    ("INT8",  1.0),
    ("INT4",  0.5),
]

print(f"Model size: {params_b}B parameters")
print(f"{'Format':<8} {'Bytes/param':<14} {'Memory (GB)':<12} {'Fits 24 GB GPU?'}")
print("-" * 54)

for name, bpp in formats:
    gb = params * bpp / 1e9
    fits = "YES" if gb <= 24 else "NO"
    print(f"{name:<8} {bpp:<14.1f} {gb:<12.1f} {fits}")

print()
print("Rule of thumb: memory_gb = params_b * bytes_per_param")

For 7B parameters the output is:

Model size: 7.0B parameters
Format   Bytes/param    Memory (GB)  Fits 24 GB GPU?
------------------------------------------------------
FP32     4.0            28.0         NO
FP16     2.0            14.0         YES
INT8     1.0            7.0          YES
INT4     0.5            3.5          YES

Rule of thumb: memory_gb = params_b * bytes_per_param

The four main quantization methods

GPTQ (Generative Pre-Trained Quantization) is the first method that brought INT4 to LLMs without catastrophic quality loss. It quantizes layer by layer, using a small calibration dataset to minimize the error introduced at each layer. Results are 4-bit or 3-bit weights, stored as a .safetensors file and loaded by libraries such as auto-gptq. Works well on any GPU.

AWQ (Activation-Aware Weight Quantization) observes which weights are “salient” — those multiplied by large activations — and protects them from aggressive rounding. In practice AWQ often matches or beats GPTQ quality at INT4 with a simpler calibration step. Hugging Face’s autoawq library handles it.

GGUF is the file format used by llama.cpp and Ollama. It supports a family of quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) that let you trade between size and quality at load time. The _K_M suffix means the scheme mixes bit-widths across layer types to protect sensitive layers. GGUF models run on CPU if needed — useful when you have no GPU at all.

bitsandbytes (the bnb library) does on-the-fly quantization in Python, integrated directly into Hugging Face Transformers via load_in_8bit=True or load_in_4bit=True. No offline quantization step required — you load the original FP16 checkpoint and bitsandbytes quantizes it during loading. Convenient for experimentation; GPTQ or AWQ are typically faster at inference.

Post-training quantization vs quantization-aware training

Post-training quantization (PTQ) takes a fully trained FP16 or FP32 model and quantizes it afterward using a small calibration dataset (a few hundred representative prompts). GPTQ, AWQ, and bitsandbytes are all PTQ methods. Fast to apply; no GPU cluster needed.

Quantization-aware training (QAT) bakes simulated quantization noise into the training loop so the model learns to work around the precision loss. The result is a model that can be quantized more aggressively with less accuracy drop than PTQ. QAT requires retraining or fine-tuning, which means GPU time and data. For most users, PTQ at INT8 or INT4 is sufficient; QAT is reserved for edge deployments where every tenth of a point matters.

The accuracy cost

INT8 is almost always safe. Benchmarks across LLaMA, Mistral, and Qwen families consistently show less than 1 point drop on standard evals (MMLU, HellaSwag). For most production use cases this is indistinguishable from FP16.

INT4 costs more — typically 2 to 5 points on reasoning and knowledge benchmarks, depending on the model and quantization method. Larger models tolerate INT4 better than smaller ones; a 70B quantized to INT4 often outperforms a 7B in FP16 both in quality and memory efficiency.

Below INT4 (Q3, Q2) quality degrades rapidly and is not recommended for general-purpose generation.

In one breath

Quantization stores each weight in fewer bits so the model fits the memory you have: memory_gb ≈ params_b × bytes_per_param.
FP32→4, FP16→2, INT8→1, INT4→0.5 bytes/param, so INT4 is an 8× cut vs FP32 — a 7B model drops from 28 GB to 3.5 GB.
It works because weights are static and rounding errors average out; activations spike, so they stay in FP16 — and INT4 weights are dequantized to FP16 before the matmul (you save bandwidth, not compute).
Pick by target: AWQ/GPTQ for GPUs, GGUF for CPU/Mac/edge, bitsandbytes for quick experiments; PTQ suffices for most, QAT only when every point matters.
INT8 is near-lossless (under 1 pt); INT4 costs ~2–5 pts and larger models tolerate it better — a 70B in INT4 often beats a 7B in FP16.

Quiz

Quick check

0/3

Q1A 13B-parameter model is loaded in FP16. How much VRAM does it require for weights alone?

Q2You load a GPTQ INT4 model. During the forward pass, what precision are the matrix multiplications actually performed in?

Q3A startup is deploying a vision-language model on Raspberry Pi 5 (8 GB RAM, no GPU). They have the original FP32 checkpoint. Which approach gives them the best chance of running it on device?

Speculative Decoding — how a small draft model speeds up inference from your quantized large model by 2x to 3x.

Quantization: Shrinking Models to Fit

What you'll learn

Before you start