datarekha

Cost & FinOps for ML/GPUs

GPU spend is now a board-level number, and MLOps owns it. Tracking cost-per-run and cost-per-inference, raising utilization, and the spot/reserved/right-sizing levers that cut the bill.

7 min read Intermediate MLOps Lesson 26 of 28

What you'll learn

  • Why GPU utilization, not price, is where most ML cost leaks
  • The cost levers — spot/reserved pricing, right-sizing, autoscaling
  • Tracking cost-per-training-run and cost-per-1k-inferences

Before you start

GPU spend went from a footnote to one of the fastest-growing lines on enterprise cloud bills, and in 2026 it’s an MLOps responsibility. The FinOps Foundation now runs a dedicated “FinOps for AI” track for exactly this reason. The core skill isn’t haggling on price — it’s understanding where the money actually goes.

The meter runs whether you use it or not

A GPU bills you for every hour it’s allocated, not every hour it’s used. A fleet sitting at 40% utilization is burning more than half its cost on idle silicon. This is the single biggest leak in ML infrastructure. Size a fleet and watch the waste:

That idle slice is where FinOps starts. Before negotiating cheaper GPUs, the question is always: why aren’t the ones we have busy?

The cost levers

  • Raise utilization — the highest-leverage lever. Batch inference requests, share GPUs across jobs (time-slicing / MIG partitioning), use a scheduler (Run:ai, Kueue) to pack work, and autoscale to zero so idle endpoints cost nothing.
  • Spot / preemptible instances — 60–90% cheaper for interruption-tolerant work (most training with checkpointing, batch jobs). Not for latency-critical serving.
  • Reserved / committed-use — commit to a 1–3 year baseline for ~40% off on-demand, for capacity you know you’ll always need.
  • Right-size the accelerator — don’t run a small model on an H100. Match the GPU (and its memory) to the job; a fractional or smaller GPU is often plenty.
  • Compress the modelquantization and distillation cut both the GPU needed and the per-inference cost.

Track two numbers

You can’t optimize what you don’t measure. The two metrics that turn cost into a decision:

  • Cost per training run — so “should we retrain nightly?” has a dollar answer (ties directly to retraining cadence).
  • Cost per 1,000 inferences — the unit economics of serving. If it exceeds the revenue per 1,000 predictions, the model loses money at scale, no matter how accurate.

Quick check

Quick check

0/3
Q1Where does most ML GPU cost typically leak?
Q2Which workload is the best fit for cheap spot/preemptible GPUs?
Q3Why track cost-per-1,000-inferences?

Next

Cost sits alongside the other platform concerns: responsible-AI ops and ML security.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Related lessons

Explore further

Skip to content