datarekha
Infrastructure April 27, 2026

Quantization in production: GPTQ, AWQ, GGUF, FP8 — what to ship

16-bit serving is dead for most production workloads. Here's how to pick between weight-only post-training quantization, activation-aware quantization, and the FP8 native path — without measuring it wrong.

12 min read · by datarekha · quantizationgptqawqfp8gguf

Two years ago, “serve in 16-bit” was the polite default. You’d pick BF16 because the model card said so, you’d buy enough H100s to fit the weights, and you’d move on to the next problem. In mid-2026, that’s a quiet sign your unit economics are broken. Every serious self-hosted deployment — Anyscale, Together, Modal, the hyperscaler endpoints — runs quantized by default, and the question isn’t whether to quantize but which flavour to ship.

This post is the working comparison. Four families of quantization actually matter in production: weight-only post-training (GPTQ, AWQ), activation-aware (AWQ again, SmoothQuant), low-bit native (FP8 on H100/H200, INT4 on consumer), and GGUF for the llama.cpp ecosystem. They all promise “near-lossless” on the model card. They behave very differently when you actually plot accuracy against throughput against memory footprint on your workload.

The four lanes, and what each one is actually for

FOUR LANES OF PRODUCTION QUANTIZATIONLANE 1FP8 nativeH100 / H200 / B200tensor cores~2x memory1.4-1.8x tputnear-losslessDEFAULT 2026LANE 2AWQ INT4weight-onlyactivation-aware~4x memory2-3x tput~1% MMLU lossA100 / L40 / 3090LANE 3GPTQ INT4layer-by-layerHessian-based~4x memory2-3x tput1-3% MMLU lossolder / cheaper GPUsLANE 4GGUFllama.cppCPU + MetalQ4_K_M sweetimatrix-calibrated~75% smallerlaptop / edge
Four lanes of production quantization in 2026. FP8 has quietly become the default for new H100-class deployments; AWQ owns the older-hardware retrofit; GPTQ remains the fallback for everything that doesn’t fit cleanly; GGUF rules the edge.

The trap is treating these as interchangeable. They aren’t. Each was designed for a different deployment shape. The decision is mostly downstream of two questions: what hardware are you targeting, and how tolerant is your eval set to the failure modes each method introduces.

Lane 1 — FP8, the quiet new default

The most underappreciated story of the last 18 months is that FP8 has won the production default wherever the hardware supports it. The H100’s Transformer Engine delivers 3,958 TFLOPS of FP8 tensor-core compute — exactly 2x the BF16 number of 1,979 TFLOPS. The H200 keeps the compute identical and roughly doubles memory bandwidth. The B200 doubles FP8 again. Native FP8 isn’t a quantization trick — it’s a hardware datapath that the GPU executes faster than FP16, with half the memory traffic.

The accuracy story is the part most people don’t believe until they measure it. The Red Hat / Neural Magic study Give Me BF16 or Give Me Death? ran constraints-controlled benchmarks across Llama-3 family models and concluded that FP8 W8A8 is “essentially lossless” — full-precision parity within noise on every benchmark in the suite. The Llama 3.1 405B case is even more striking: the larger the model, the more redundancy in the weights, the more the FP8 rounding error gets absorbed.

In throughput terms, vLLM’s own FP8 docs report 1.4-1.8x tokens-per-second uplift on Llama-3-70B on H100s, with the caveat that the win is biggest at large batch sizes where memory bandwidth is the binding constraint. At batch size 1, the win shrinks; at batch size 256, it’s a clean 1.6x.

The operational story is simpler than people expect:

  • vLLM, SGLang, and TGI all support FP8 natively. You add --quantization fp8 to the launch command and you’re done. The weights load in FP8; the scheduler treats them like any other dtype.
  • Dynamic per-token activation quantization is the default and the right answer. Static activation calibration buys you maybe 3% more throughput at the cost of a calibration step. Skip it unless you’re truly tokens-bound.
  • NVIDIA’s TensorRT-LLM Model Optimizer or AMD Quark are the right tools if you want pre-quantized weights — the resulting models are typically 30% faster at startup and identical at steady-state.

If you’re on H100-class or newer hardware in 2026 and you’re still serving in BF16, you’re paying roughly 1.6x your serving bill for no accuracy benefit.

The one operational caveat worth flagging: FP8 calibration interacts oddly with some fine-tuning workflows. If you LoRA-tune a model and then try to serve at FP8, the activations from the LoRA adapters can sit outside the calibrated range of the base model, producing accuracy regressions that don’t show up until you measure on your fine-tuned eval set. The fix is to re-calibrate (or merge the LoRA into the base weights before FP8 conversion) — both vLLM and TensorRT-LLM document this, but it catches teams that ship LoRA-tuned models without re-validating.

Lane 2 — AWQ, the cheapest path on older GPUs

Not everyone has H100s. If you’re on A100s, L40s, or even consumer 3090/4090 hardware, FP8 native doesn’t exist as a tensor-core datapath. You can emulate it, but you don’t get the speedup — and at that point, INT4 weight-only quantization is the better deal.

AWQ (Activation-aware Weight Quantization), the MLSys 2024 best paper from MIT’s Han Lab, is the production winner here. The insight: 1% of weights account for most of the activation magnitude. Protect those by scaling the salient channels before rounding the rest, and 4-bit quantization holds up remarkably well. The paper shows AWQ outperforming GPTQ across Llama, Mistral, and Mixtral families, with ~1% MMLU degradation versus FP16 on 70B-class models and a 3x+ inference speedup over the HuggingFace FP16 baseline.

What makes AWQ the production choice:

  • No backpropagation, no per-layer Hessian. Calibration runs in 10-20 minutes on a single GPU. GPTQ takes hours.
  • Better accuracy at the same bit width. Side-by-side on Llama-3-70B, AWQ at 4 bits typically lands within 0.5% of GPTQ-quantized FP16 on reasoning benchmarks, and noticeably ahead on instruction-following.
  • Hardware-friendly. AWQ-quantized weights use a simple per-group zero-point encoding that translates directly to fused INT4 kernels in vLLM, TGI, and TensorRT-LLM.

The standard recipe in production: pick AWQ with group size 128, calibrate on 512 samples of your domain text, ship. The 2025 ACL paper compiling quantization results across model sizes confirms that AWQ at INT4 retains 98.9% of FP16 accuracy on HumanEval+, with the largest losses concentrated in long-context tasks (which is a general INT4 issue, not an AWQ one).

Lane 3 — GPTQ, the workhorse with a longer tail

GPTQ predates AWQ and is, in 2026, the second choice for new deployments. It’s still worth understanding because the existing weight zoo on HuggingFace skews GPTQ — a lot of the community-contributed quantizations are 4-bit GPTQ artefacts that were never re-quantized to AWQ.

The mechanism is different: GPTQ minimises layer-wise reconstruction error using approximate second-order information (Hessian inverse). It’s more computationally expensive than AWQ to quantize, but produces high-quality weights — and at the same bit width on the same hardware, the runtime characteristics are similar. Where GPTQ loses to AWQ is on the hardest evals: code generation, math, multi-hop reasoning. The 2025 IJCAI quantization survey ranks GPTQ at INT4 at 1-3% MMLU loss on Llama-3-70B, versus AWQ’s ~1%, with the gap widening as model size shrinks.

Use GPTQ when:

  • The weights you want already exist on HuggingFace as a GPTQ artefact and re-quantizing would mean burning a day of calibration compute.
  • Your eval set is dominated by easy tasks (classification, intent routing, simple QA) where the AWQ/GPTQ gap is noise.
  • You’re integrating with a stack that has more mature GPTQ kernels than AWQ kernels (some embedded/edge inference runtimes still favour GPTQ).

For everything else in 2026, AWQ is the cleaner default.

Lane 4 — GGUF and the laptop deployment

GGUF lives in a different universe. It’s the file format and quantization scheme for llama.cpp, and through llama.cpp it powers Ollama, LM Studio, LocalAI, and most of the “model on my MacBook” stack. The K-quants family (Q4_K_M, Q5_K_M, Q6_K) plus the newer i-quants (IQ3_XS, IQ4_XS) are aggressive low-bit weight encodings that ship with importance-matrix calibration.

A few things matter when you’re targeting this lane:

  • Q4_K_M is the production sweet spot. It hits ~75% size reduction versus FP16 with a perplexity bump of under 5%. Below Q4, the imatrix calibration becomes load-bearing — without a good importance matrix, IQ3 quantizations hallucinate badly.
  • CPU and Apple Silicon are first-class. GGUF is the only mainstream format where Apple’s Metal backend gets first-tier kernel support. A Q4_K_M Llama-3-8B runs at ~25 tokens/sec on an M3 Max — fast enough for interactive use.
  • The accuracy story is not the same as AWQ INT4. Q4_K_M is a mixed-precision scheme (some layers stay at higher precision based on sensitivity heuristics), not pure INT4. It’s typically closer to AWQ in quality than GPTQ-INT4, particularly on chat workloads.

The mistake is using GGUF on server hardware. The kernels weren’t designed for batched GPU inference — they’re optimised for the single-stream, memory-bandwidth-bound regime of a laptop or a small server. On an H100, a vLLM AWQ deployment will beat a llama.cpp deployment by 4-6x on throughput. GGUF wins exactly when GPUs aren’t the answer.

Where INT4 quietly falls apart

Both AWQ-INT4 and GPTQ-INT4 ship the “near-lossless on MMLU” headline. The headline lies on three specific axes:

WHERE INT4 BREAKS THAT MMLU DOESN’T SHOWLong contextdrops up to 59%at 32k+ contextattention precision errorscompound across tokensFIX: stay at FP8 / W8for context > 16kCode generationsyntax errors spikeon rarer tokensstructured generationis sensitive to tail probsFIX: AWQ + constraineddecoding, not GPTQMath & logicmulti-step reasoningdegrades non-linearlysmall per-token errorsbecome wrong answersFIX: FP8 minimum,measure GSM8K not MMLU
Three workloads where INT4’s “near-lossless” claim quietly stops being true. Pick an eval that exercises your shape, not the model card’s.

The 2025 long-context quantization paper documented up to 59% accuracy collapse on needle-in-a-haystack and multi-document QA tasks at INT4, while the corresponding MMLU drop was under 2%. The reason is mechanical: attention is computed in a higher-precision intermediate, but the errors in the quantized weight projections accumulate quadratically in the number of tokens being attended over. By context length 32k, you’re no longer measuring the same model.

The defensive default in 2026:

  • Stay at FP8 or W8A8 for production code generation, long-context QA, and math-heavy agents. The throughput delta versus INT4 is smaller than people fear (often 20-30%), and the accuracy retention is meaningfully better.
  • Use AWQ-INT4 for chat, summarisation, classification, routing. The workloads where most tokens are easy.
  • Always measure on your own eval set with your own context length. The “1% MMLU drop” number is from MMLU, which is mostly short-context multiple-choice. Your workload is probably not that.

Picking your lane

A working decision rule, derived from watching a lot of teams over- and under-quantize:

  1. H100/H200/B200 hardware and a workload that includes any code, math, or long context → FP8 native, full stop. The hardware is built for it, the accuracy is preserved, and the throughput is free.
  2. A100/L40/older GPUs and chat-shaped workload → AWQ-INT4 with group size 128, calibrated on your domain. The cheapest path to a 3-4x throughput gain.
  3. Existing GPTQ artefact already on HuggingFace, eval set tolerates ~2% accuracy hit → ship the GPTQ weights, skip the re-quantization burn.
  4. Laptop, edge, or cost-constrained CPU deployment → GGUF Q4_K_M with imatrix calibration. Nothing else competes on this hardware.
  5. Anything you’re not sure about → ship FP8 first, measure, then explore whether INT4 buys you enough to justify the eval re-validation cost.

The mistake to avoid is the reverse: starting from the most aggressive quantization that “looks fine on MMLU” and then discovering at month two that your code-generation evals quietly regressed. Quantization is a deployment-shape decision, not a Twitter heuristic.

What’s changing in the next 12 months

Two trends to watch through the rest of 2026:

  • FP4 native on Blackwell. B100/B200 introduce FP4 tensor cores, and early NVIDIA NeMo benchmarks are showing 2x throughput over FP8 with minimal accuracy loss on chat-shaped workloads. The serving stacks (vLLM and SGLang both) are racing to add first-class FP4 support through the second half of the year.
  • Quantization-aware fine-tuning is mainstreaming. QLoRA established that fine-tuning quantized models is viable; the newer EfficientQAT, GPTQ-LoRA, and BitsAndBytes 4-bit Adam variants are making it production-quality. By Q4 2026, “fine-tune your INT4 model directly” is becoming a normal workflow rather than a research artefact.

The high-order trend is unchanged: precision is no longer the free variable in serving design. The bet you’re making when you pick a quantization lane is a bet on your hardware fleet, your accuracy budget, and your willingness to re-validate when the model updates. Treat it that way and the choice gets easier; treat it as a “just pick the smallest number that works on MMLU” exercise and you’ll regret it in a quarter.


Further reading: the AWQ paper, the GPTQ paper, Give Me BF16 or Give Me Death? on FP8/INT4/INT8 accuracy trade-offs, vLLM’s FP8 W8A8 docs, and the llama.cpp quantization README for the GGUF zoo.

Skip to content