How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

When would you choose AWS Lambda instead of ECS, and when would you choose ECS?

Choose Lambda for short, event-driven, bursty work that fits its runtime and packaging constraints and keeps durable state external. Choose ECS for long-running APIs or workers, custom containers, stable processes, and workloads needing explicit CPU, memory, networking, or runtime control. ECS can host stateless or stateful software; critical state should still be externally durable.

What metrics should you monitor for a production ML model, and at what layer?

Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.

Walk me through the full ML lifecycle from problem definition to model retirement.

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

The cloud — AWS, Azure & GCP for ML — MLOps

We just closed two chapters on getting models to behave in production, and ended by noticing what every one of those lessons quietly assumed: the gateways, vector stores, eval runners, and GPUs all run somewhere, on computers we never once asked about. Almost certainly not in your office — in the cloud, on rented machines billed by the second. This chapter is that ground, and it opens with the bill that teaches everyone the same lesson once.

A junior engineer spun up a GPU instance to fine-tune a model on Friday, got it working, closed the laptop, and went home. The instance kept running all weekend — nobody told it to stop. Monday’s surprise was a $900 line item for two days of doing nothing. Nobody was malicious or careless; they just hadn’t internalised the one fact that governs the cloud: you are renting, and the meter runs whether you’re using it or not.

The cloud feels impossibly large — AWS alone lists hundreds of services — but the part a data or ML person actually touches is small, and it’s the same small part on all three big providers. This lesson is the map.

What “the cloud” actually is

Strip away the branding and the cloud is one idea: instead of buying computers, you rent someone else’s by the second. That swap changes everything downstream.

Capex → opex. No up-front purchase of servers (capital expenditure); you pay as you go (operating expenditure). A startup can rent a $30,000 GPU box for an afternoon for a few dollars.
Elastic. Need 100 machines for an hour, then zero? You can have exactly that. Some services even scale to zero — you pay nothing when no request is in flight.
Managed. The provider runs the hard parts — replacing dead disks, patching the OS, replicating your data across buildings — so a tiny team can run infrastructure that used to need a department.

The flip side is the rental trap: the meter never sleeps. A forgotten GPU, or a job that quietly copies a terabyte across regions, bills you all the same. Cost awareness is a cloud skill, not an afterthought.

Three providers, one mental model

There are three providers you’ll meet: AWS (Amazon — the biggest, the most services, the default for startups), Azure (Microsoft — the enterprise default, and the home of the hosted OpenAI models), and GCP (Google — strongest in data and ML, home of BigQuery and Vertex AI). They compete hard, which means they mostly mirror each other.

The trap beginners fall into is trying to memorise the catalog. Don’t. Learn the categories; the names are just translations. Here’s the Rosetta stone for the services you’ll actually use:

What it is	AWS	Azure	GCP
Rent a server (a VM)	EC2	Virtual Machines	Compute Engine
Run code, no server (serverless)	Lambda	Functions	Cloud Functions / Run
Object storage (the “bucket”)	S3	Blob Storage	Cloud Storage (GCS)
Managed Kubernetes	EKS	AKS	GKE
Managed ML platform	SageMaker	Azure ML	Vertex AI
Data warehouse	Redshift	Synapse	BigQuery
Hosted LLM API	Bedrock	Azure OpenAI	Vertex AI (Gemini)

Read a row, not a column. “Where do I put my files?” is object storage — S3, Blob, or GCS depending on which house you’re in, but the same idea: a near-infinite, cheap, durable key→blob store you reach over HTTP.

The four things you’ll actually touch

Out of the hundreds of services, four cover the vast majority of data/ML work:

Compute — a machine to run code on. Rent a whole VM (EC2 / Azure VM / Compute Engine) when you want control, or go serverless (Lambda / Functions / Cloud Run) to just hand over a function and let the platform run and scale it.
Object storage — the bucket (S3 / Blob / GCS). Your datasets, model artifacts, and logs live here. Cheap, durable, accessed by key. This is the backbone; almost everything reads from and writes to it.
Managed ML — SageMaker / Azure ML / Vertex AI. Training jobs, notebooks, a model registry, and one-click endpoints, without you running Kubernetes by hand (it’s there, just hidden — see Just enough Kubernetes for an ML engineer).
Data warehouse — Redshift / Synapse / BigQuery. Where analytics SQL runs over billions of rows. (Databricks runs across all three as a cloud-neutral option.)

A spectrum, not a switch

The deeper intuition: cloud services sit on a spectrum of how much the provider runs for you. More managed means less operations work and less control — and usually a higher price per unit of compute, traded for not needing an ops team.

The same job can run anywhere on this line. Move right to delete operations work; move left when you need control or cheaper bulk compute.

There’s no single right answer. A research team training big models lives on the left (raw VMs with GPUs, maximum control). A two-person startup serving a model lives on the right (a managed endpoint or serverless, zero ops). Most teams sit in the middle on the managed ML platforms.

It bills by the second — and for leaving

Two cost facts cause most surprise bills. First, compute is billed by the second it’s running, so an idle-but-on machine is pure waste. Second, storing data is cheap, but moving data out of the cloud (egress) is not — providers charge per gigabyte you download, which is the fee people forget until the invoice. Run the numbers:

# Representative 2026 on-demand list prices (yours will vary by region/provider).
GPU_PER_HOUR = 3.06            # one mid-range training GPU instance, $/hour
STORAGE_PER_GB_MONTH = 0.023   # object storage (S3 / Blob / GCS), $/GB/month
EGRESS_PER_GB = 0.09           # downloading data OUT of the cloud, $/GB

# 1) The Friday mistake: a GPU box left on 24/7 vs only when you use it.
always_on  = GPU_PER_HOUR * 24 * 30      # every hour, every day
when_used  = GPU_PER_HOUR * 6 * 22       # 6 h/day, 22 working days
print("GPU left on 24/7     : $" + f"{always_on:,.0f} / month")
print("GPU on 6h x 22 days  : $" + f"{when_used:,.0f} / month")
print("auto-stop saves      : $" + f"{always_on - when_used:,.0f} / month")
print()

# 2) Storage is cheap; egress (moving data out) is the line item people miss.
dataset_gb = 500
store  = dataset_gb * STORAGE_PER_GB_MONTH
egress = dataset_gb * EGRESS_PER_GB
print("Store 500 GB / month : $" + f"{store:,.2f}")
print("Download 500 GB once : $" + f"{egress:,.2f}   (egress)")
print("=> moving it out costs " + f"{egress/store:.0f}" + "x the monthly storage.")

GPU left on 24/7     : $2,203 / month
GPU on 6h x 22 days  : $404 / month
auto-stop saves      : $1,799 / month

Store 500 GB / month : $11.50
Download 500 GB once : $45.00   (egress)
=> moving it out costs 4x the monthly storage.

Two numbers indict two everyday habits. The GPU left running bills $2,203 a month; the same box stopped when idle bills $404 — an auto-stop policy alone pockets roughly $1,800, for changing nothing but a setting. And the storage-vs-egress line is the one that ambushes people: parking half a terabyte costs a trivial $11.50 a month, yet downloading it once costs $45 — moving data out is 4× the cost of storing it. “The cloud charges you to leave” is not a quip; it’s a design constraint. Keep compute in the same region as the data it reads, and turn things off.

In one breath

The cloud is one idea — rent computers by the second instead of buying them — which flips capex to opex, makes infrastructure elastic and managed, and bills you while machines sit idle; out of hundreds of services a data/ML person touches just four (compute, object storage, managed ML, a warehouse), and the three providers mostly mirror each other, so you learn categories not catalogs (S3 = Blob = GCS), place each workload on the managed-vs- serverless spectrum, and respect the cost model that surprises everyone — per-second compute plus egress charges for moving data out.

Practice

Before the quiz, translate one workload across the Rosetta stone. You’re an AWS shop and a teammate references “Vertex AI” and “BigQuery” from a GCP tutorial — name the AWS equivalent of each and the category it belongs to. Then the cost instinct: the demo showed a forgotten GPU costs ~$1,800/month more than an auto-stopped one, and egress costs 4× storage. Given those two facts, what are the first two things you’d configure before launching anything, and which one converts a $900 weekend into a Saturday-morning email?

Quick check

0/3

Q1You need to store 2 TB of training images that your jobs read from repeatedly. Which primitive is that?

Q2A teammate says 'we should use Vertex AI but our company is an AWS shop.' What's the equivalent service on AWS?

Q3Your monthly cloud bill jumped, but your compute usage is flat. Your app serves model files to users worldwide from a bucket in one region. What's the most likely culprit?

A question to carry forward

You now have the four primitives and the spectrum they live on — and notice where the easy answer sat. For most teams it was “use the managed ML platform” (SageMaker, Vertex, Azure ML): one-click endpoints, the provider runs the infrastructure, you never see a server. That works beautifully right up until it doesn’t.

Because the managed platform is a walled garden. The moment you need a multi-step training pipeline with your own containers, a custom scheduler, components shared across teams, or simply to not be locked to one provider’s opinions, you stop consuming a managed endpoint and start building on the layer underneath it — the one the managed platform was hiding all along. Every one of those services, we mentioned in passing, has Kubernetes quietly running inside it. So the question to carry forward is: when one managed endpoint stops being enough and you need to orchestrate real ML pipelines on your own cluster, what is that underlying layer, and how do you run ML on it without hand-writing a thousand lines of YAML? That is Kubeflow — ML-native pipelines on Kubernetes — and it is the next lesson.

The cloud — AWS, Azure & GCP for ML

What you'll learn

Before you start

What “the cloud” actually is

Three providers, one mental model

The four things you’ll actually touch

A spectrum, not a switch

It bills by the second — and for leaving

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further