How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Cost & FinOps for ML/GPUs — MLOps

We just finished assembling a serious platform — cluster, pipelines, a feature store with its online Redis and materialize job — and ended by naming the thing we’d treated as free all along: every piece of it runs on rented hardware billed by the second, much of it on the most expensive silicon you can rent. We asked what it actually costs and where the money leaks. This lesson answers, and the short version is that the leak is almost never where you’d guess.

GPU spend went from a footnote to one of the fastest-growing lines on enterprise cloud bills, and in 2026 it’s an MLOps responsibility. The FinOps Foundation now runs a dedicated “FinOps for AI” track for exactly this reason. The core skill isn’t haggling on price — it’s understanding where the money actually goes.

The meter runs whether you use it or not

A GPU bills you for every hour it’s allocated, not every hour it’s used. A fleet sitting at 40% utilization is burning more than half its cost on idle silicon. This is the single biggest leak in ML infrastructure. Size a fleet and watch the waste:

TryFinOps · what your GPUs actually cost

The meter runs whether you use it or not

Size a GPU fleet and watch the monthly burn — and how much of it is idle waste. GPU spend is the fastest-growing line on enterprise cloud bills, and utilization is where most of it leaks.

GPU

count 8utilization 45%

pricing

monthly burn$21,024

per year$252,288

useful $9,461/mo idle waste $11,563/mo

Switch to spot and you'd save ~$13,035/mo (with interruption risk).

At 45% utilization you could right-size to ~6 GPUs and save another $5,256/mo.

The lesson FinOps drives home: a GPU bills you for every hour it's allocated, not used. At 45% utilization, $11,563 a month is pure waste. The big levers are raising utilization(batching, sharing, autoscaling to zero), spot/reserved pricing, and right-sizing the accelerator to the job. Track cost-per-training-run and cost-per-thousand-inferences, and these become decisions, not surprises.

That idle slice is where FinOps starts. Before negotiating cheaper GPUs, the question is always: why aren’t the ones we have busy?

The cost levers

Raise utilization — the highest-leverage lever. Batch inference requests, share GPUs across jobs (time-slicing / MIG partitioning), use a scheduler (Run:ai, Kueue) to pack work, and autoscale to zero so idle endpoints cost nothing.
Spot / preemptible instances — 60–90% cheaper for interruption-tolerant work (most training with checkpointing, batch jobs). Not for latency-critical serving.
Reserved / committed-use — commit to a 1–3 year baseline for ~40% off on-demand, for capacity you know you’ll always need.
Right-size the accelerator — don’t run a small model on an H100. Match the GPU (and its memory) to the job; a fractional or smaller GPU is often plenty.
Compress the model — quantization and distillation cut both the GPU needed and the per-inference cost.

Track two numbers

You can’t optimize what you don’t measure. The two metrics that turn cost into a decision:

Cost per training run — so “should we retrain nightly?” has a dollar answer (ties directly to retraining cadence).
Cost per 1,000 inferences — the unit economics of serving. If it exceeds the revenue per 1,000 predictions, the model loses money at scale, no matter how accurate.

In one breath

GPU cost is now MLOps’s job, and the skill isn’t haggling on price — it’s knowing the meter bills for every allocated hour, not every used one, so the biggest leak is low utilization (a fleet at 40% wastes over half its spend on idle silicon); the levers, in order of leverage, are raise utilization (batch, GPU-share, autoscale to zero), then spot/preemptible instances for interruption-tolerant training (60–90% off), reserved capacity for known baselines (~40% off), right-sizing the accelerator, and compressing the model — all of it made visible by tracking two numbers, cost per training run and cost per 1,000 inferences, the latter being the serving unit economic that can sink a model more accurate than it is profitable.

Practice

Before the quiz, reason about the counterintuitive core. A teammate proposes cutting the bill by negotiating a cheaper per-hour GPU rate. Using the allocated-not-used idea, explain why raising utilization on the GPUs you already have almost always beats a price discount — and which lever takes idle endpoint cost all the way to zero. Then the unit-economics gate: a model is 4% more accurate but costs $0.012 per 1,000 inferences against $0.009 of revenue — should it ship, and why is “but it’s more accurate” not the deciding factor?

Quick check

0/3

Q1Where does most ML GPU cost typically leak?

Q2Which workload is the best fit for cheap spot/preemptible GPUs?

Q3Why track cost-per-1,000-inferences?

A question to carry forward

Notice what kind of accountability this lesson was about: a dollar one. We learned to make the platform cheap — to tag spend, watch utilization, gate a model on cost-per-inference. And cost is real accountability, the kind a finance team enforces with a budget alert.

But it is not the only bill a production model runs up, and the others don’t show up on a cloud invoice. The same model that’s cheap, fast, and accurate can quietly deny loans to one group at twice the rate of another, can’t explain a single decision when a regulator asks, and may already be out of compliance with a law that took effect while you were optimizing GPU hours. Those costs are paid by people, not by your AWS account — and increasingly, by your legal team. So the question to carry forward is the accountability that doesn’t fit on a dashboard of dollars: once a model is performant and affordable, how do you make it fair, explainable, and compliant — and operate those properties the way we operate everything else? That is responsible-AI ops, and it is the next lesson.

Cost & FinOps for ML/GPUs

What you'll learn

Before you start

The meter runs whether you use it or not

The meter runs whether you use it or not

The cost levers

Track two numbers

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further