Cost & FinOps for ML/GPUs
GPU spend is now a board-level number, and MLOps owns it. Tracking cost-per-run and cost-per-inference, raising utilization, and the spot/reserved/right-sizing levers that cut the bill.
What you'll learn
- Why GPU utilization, not price, is where most ML cost leaks
- The cost levers — spot/reserved pricing, right-sizing, autoscaling
- Tracking cost-per-training-run and cost-per-1k-inferences
Before you start
GPU spend went from a footnote to one of the fastest-growing lines on enterprise cloud bills, and in 2026 it’s an MLOps responsibility. The FinOps Foundation now runs a dedicated “FinOps for AI” track for exactly this reason. The core skill isn’t haggling on price — it’s understanding where the money actually goes.
The meter runs whether you use it or not
A GPU bills you for every hour it’s allocated, not every hour it’s used. A fleet sitting at 40% utilization is burning more than half its cost on idle silicon. This is the single biggest leak in ML infrastructure. Size a fleet and watch the waste:
That idle slice is where FinOps starts. Before negotiating cheaper GPUs, the question is always: why aren’t the ones we have busy?
The cost levers
- Raise utilization — the highest-leverage lever. Batch inference requests, share GPUs across jobs (time-slicing / MIG partitioning), use a scheduler (Run:ai, Kueue) to pack work, and autoscale to zero so idle endpoints cost nothing.
- Spot / preemptible instances — 60–90% cheaper for interruption-tolerant work (most training with checkpointing, batch jobs). Not for latency-critical serving.
- Reserved / committed-use — commit to a 1–3 year baseline for ~40% off on-demand, for capacity you know you’ll always need.
- Right-size the accelerator — don’t run a small model on an H100. Match the GPU (and its memory) to the job; a fractional or smaller GPU is often plenty.
- Compress the model — quantization and distillation cut both the GPU needed and the per-inference cost.
Track two numbers
You can’t optimize what you don’t measure. The two metrics that turn cost into a decision:
- Cost per training run — so “should we retrain nightly?” has a dollar answer (ties directly to retraining cadence).
- Cost per 1,000 inferences — the unit economics of serving. If it exceeds the revenue per 1,000 predictions, the model loses money at scale, no matter how accurate.
Quick check
Quick check
Next
Cost sits alongside the other platform concerns: responsible-AI ops and ML security.
Practice this in an interview
All questionsApply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.
Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.
LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.