datarekha
MLOps Medium

How do you attribute and control ML spend across teams and models (FinOps for ML)?

The short answer

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

How to think about it

The short answer

FinOps for ML is about visibility, attribution, and accountability. Tag every workload — training jobs, inference endpoints, GPU pools — by team, model, and environment so spend is attributable. Then manage by unit economics (cost per prediction, per training run, per million tokens), set budgets and alerts, and enforce guardrails so teams optimize without killing experimentation.

Why attribution first

You can’t control what you can’t see. A single untagged GPU cluster shared by five teams produces a bill nobody owns, so nobody optimizes it. Tagging turns the bill into per-team, per-model line items, which creates accountability and surfaces the biggest offenders. The FinOps Foundation’s AI work emphasizes this continuous, cross-functional cost discipline.

What to track and control

  • Unit metrics, not just totals: cost per prediction/token/training run normalizes across scale and makes regressions visible.
  • Idle and overprovisioned resources: idle GPUs and always-on endpoints sized for peak are the classic waste; right-size and autoscale on queue/batch depth.
  • Budgets + alerts per team, with anomaly detection on sudden spikes (e.g., a runaway training sweep).
  • Policy guardrails: instance-type allowlists, spot for non-critical jobs, auto-shutdown of idle dev endpoints, quotas on GPU hours.

Concrete example

Tagging reveals that one team’s dev endpoints run 24/7 at 4% utilization. You add auto-shutdown after idle and move them to spot — costs drop without touching prod. A dashboard of cost per 1k predictions per model then flags when a new model version doubles serving cost, prompting a quantization pass before it ships.

Common follow-up / trap

A frequent probe: “How do you cut cost without slowing the data scientists down?” The answer is guardrails over gates — defaults like auto-shutdown, spot, and budgets give freedom within limits. The trap is optimizing only total spend; without unit economics you can’t tell a model that’s expensive because it’s heavily used (fine) from one that’s inefficient per call (fix it).

Learn it properly Cost & FinOps for ML/GPUs

Keep practising

All MLOps questions

Explore further

Skip to content