datarekha
MLOps Medium

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

The short answer

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How to think about it

The short answer

Optimize top-down, because earlier layers dominate cost: model → runtime → infrastructure, with FinOps measurement running continuously. The order matters — a model-layer change (smaller/quantized model, routing) affects every dollar spent downstream.

The layers

1. Model layer (biggest lever).

  • Quantization: FP8/INT8/INT4 shrinks memory and boosts throughput. FP8 on H100 gives ~1.3–2x throughput at <2% quality loss on instruction-tuned models.
  • Distillation to a smaller student model.
  • Model routing: send easy requests to a cheap model and only escalate hard ones to the premium model — commonly cuts inference cost 40–70%.

2. Runtime layer.

  • Batching to raise GPU utilization.
  • Prompt caching for LLMs with long shared system prompts/context — can save 50–90% on the cached portion.
  • KV-cache reuse, speculative decoding.

3. Infrastructure layer.

  • Autoscale on queue depth or batch size, not raw GPU utilization, so capacity tracks real load and you don’t overprovision.
  • Use spot/batch capacity for non-latency-critical workloads.

Measure so you don’t degrade quality

Track cost per million tokens (or per prediction) as the headline efficiency metric, alongside latency p90/p99, GPU utilization, and accuracy. The last one is the guardrail: every optimization must confirm quality didn’t silently drop.

Concrete example

An LLM app pays for premium-model calls on every request. You add prompt caching for the shared system prompt (−60% on input tokens), route trivial queries to a small model (−50% of calls), and autoscale on queue depth. Cost per million tokens drops sharply while p99 latency and eval scores stay flat.

Common follow-up / trap

The trap is jumping straight to “buy cheaper GPUs” (infra layer) before fixing the model and runtime, where the real savings are. Another trap: optimizing cost without a quality guardrail — quantize too aggressively and you ship a worse product. Lead with the model layer and always pair cost metrics with an accuracy check.

Learn it properly Cost & FinOps for ML/GPUs

Keep practising

All MLOps questions

Explore further

Skip to content