ML platform build vs buy: a decision framework for 2026

The most expensive technical decision a growing ML organisation makes is not which model to train. It’s the platform layer underneath it: do you build your own MLOps stack on Kubernetes, do you buy Databricks or Snowflake or SageMaker, or do you piece together open source — MLflow, Kubeflow, Ray, KServe — into something that’s “yours” but isn’t really built. The decision compounds for years. Get it wrong at headcount fifteen and you’ll be re-platforming at headcount fifty, by which point the cost of switching is the cost of a medium-sized acquisition.

This post is the working decision framework. It’s built around three real factors that matter — team size, workload variety, governance pressure — and the case studies of companies that picked each lane and either thrived or re-platformed.

The three lanes, honestly drawn

Three lanes with three economic shapes. The right answer depends mostly on where you sit on the team-size and workload-variance curves, not which lane has the best Twitter thread this week.

The three lanes aren’t a feature comparison — they’re three different cost curves. Managed platforms have low fixed cost and a hard ceiling on optimisation. Open-source assembly has medium fixed cost and lower marginal cost. Custom build has the highest fixed cost and the lowest floor on unit economics. The “right” answer is whichever curve crosses your trajectory at the lowest total cost over a three-year horizon.

The teams that get this wrong make one of two mistakes. The first is optimising for the wrong axis — picking SageMaker because “AWS owns everything else” when their real bottleneck is custom GPU scheduling across multi-region inference. The second is the heroic mistake: a small team with no platform engineering background deciding to “just build it ourselves on Kubernetes” because they read the Uber Michelangelo blog post.

Why most teams should default to managed

For teams under fifteen ML engineers, the math almost always favours managed. Even when the listed pricing of SageMaker or Databricks looks absurd compared to “just rent some EC2 nodes,” the all-in cost — including the platform engineer you’d need to hire — favours the managed option.

The reason is that platform engineering is a real specialty. Building a reliable training pipeline, a robust model registry, an inference layer with autoscaling and circuit breakers, and the observability glue to make it all debuggable is two to four FTE-years of senior platform engineering work. At $300k loaded cost per senior platform engineer, that’s $600k to $1.2M before you ship a single model. Databricks’s per-DBU cost looks a lot more reasonable next to that.

The pattern shows up consistently in the data:

The MLOps Community survey has tracked the same finding for four years now: median ML team size is 5-10, and the platforms they use most are managed (SageMaker, Vertex, Databricks) followed by partially-assembled open source (MLflow plus cloud notebooks).
A reasonable working heuristic that’s held up: if your team is under fifteen people, you should not be building your own MLOps platform. You should be on a managed offering, possibly extended with light open-source glue (MLflow for tracking, Feast for features) where the managed option is genuinely deficient.

Where managed loses is on three specific failure modes:

Workload diversity beyond what the platform was designed for. SageMaker is excellent for tabular training and pre-canned deep learning; it’s awkward for graph neural networks, reinforcement learning loops, custom CUDA kernels, or anything that doesn’t fit the “fit/predict” shape.
Latency budgets below what the managed inference layer can deliver. Vertex AI’s autoscaling cold start, SageMaker’s endpoint provisioning latency, even Databricks Model Serving — they all sit in the 200-500ms floor range. If your product needs sub-50ms inference at p99, you’ll end up rolling your own serving layer.
Unit economics at scale. Past a certain spend (roughly $5M/year on compute, in my experience), the markup the managed platform charges over raw cloud compute starts to dominate. That’s when teams start the painful conversation about whether the platform is paying for itself.

Why some teams should genuinely build

The teams that should build their own MLOps platform have a recognisable signature. They have at least thirty ML engineers, they have workloads that don’t fit the managed mould (real-time ranking, low-latency inference, huge-scale recommendation systems, RL with custom simulators), and they have a multi-year horizon. Uber, LinkedIn, Spotify, Pinterest, Netflix, Airbnb — all of these built their own.

Uber’s Michelangelo, introduced in 2017, is the canonical case. Uber’s ML workloads are an absurd shape — ride-pricing models that need to refresh every five minutes, ETA predictions with sub-100ms inference, fraud detection running over streaming Kafka topics, all on a fleet of hundreds of models that must share feature stores and training pipelines. There was no SageMaker equivalent that could have served that. Michelangelo today is the de facto serving system at Uber, running 100% of their business-critical ML use cases including GenAI, and the team that built it spans dozens of platform engineers across multiple sub-teams (training infra, feature store, online serving, model registry, observability).

Spotify’s evolution is instructive in a different way — they didn’t build a “platform” from nothing. They composed Kubeflow, Ray, and a custom feature store on top of GKE, and built an opinionated developer experience (ML Home) on top of those primitives. The result was a 7x increase in experiments per unit time and a faster path to production. The lesson: “build” doesn’t mean “from scratch.” It means “owning the abstraction your team interacts with, even if you didn’t write every component underneath.”

What separates the build-it-and-thrive teams from the build-it-and-regret teams is almost never technical skill. It’s three preconditions:

Sustained executive sponsorship. Platform work is a multi-year investment with no quarterly demo. Teams that build successfully have a VP-level champion who absorbs the “but we shipped no features this quarter” pressure.
A dedicated platform team, not a 20%-time rotation. Platforms built by “everyone helps out” tend to collapse into half-finished abstractions that no one owns. Successful build orgs treat platform as a permanent product team.
Real internal customers from day one. The platform exists to serve the model-builders. Teams that build a platform without an internal user keeping them honest produce gorgeous abstractions no one wants to adopt.

The middle path — open source as glue

The interesting middle ground that’s matured in the last three years is the opinionated open-source assembly — a small team picks a stack of best-of-breed open-source components, ties them together with light custom code, and runs it themselves on Kubernetes or directly on cloud primitives. The canonical example is the “MLflow + Ray + KServe + Argo + Feast” stack, with a thin internal CLI on top.

A working decision tree. Start with team size and workload shape, then check the disqualifiers. If any disqualifier hits, escalate one lane to the right.

The assembly path is the right answer for teams in the 15-30 engineer range, where the managed cost has started to bite but the in-house platform team isn’t yet large enough to support a full custom build. The trick is to be conservative about which abstractions you write yourself. A team that adopts MLflow but doesn’t re-implement experiment tracking will save itself years of toil. A team that adopts KServe and doesn’t write its own model server will avoid an entire category of production incidents.

The failure mode of the middle path is abstraction sprawl — each component has its own concept of “experiment” or “model” or “deployment,” and gluing them together produces an internal mental model that no new engineer can hold. Teams that succeed in the assembly path invest heavily in the opinions layer on top — a single internal CLI or web UI that hides the underlying tools and presents a unified model lifecycle.

The build-then-regret pattern

The most common painful trajectory I’ve watched: a 10-engineer team in 2024 decides Databricks is too expensive, spends 18 months building their own platform on Kubernetes with MLflow plus custom glue, and by mid-2026 has accumulated enough operational debt that they end up re-platforming back onto Databricks anyway. The math, in retrospect: they spent $1.5M of platform engineering time to save $400k/year of Databricks cost. The savings were real; the time-to-value was disastrous.

The mirror image is rarer but happens at the other end: a 100-engineer ML org locked into SageMaker, paying a 3-4x markup on raw compute, forced into the SageMaker abstractions for every model. When the unit economics finally break, the migration off SageMaker is a multi-year project with significant disruption — but the savings on the other side justify it.

The common error in both cases is picking based on the technology rather than the trajectory. The right question is not “which platform has the best feature set today” — it’s “which platform’s cost curve matches my organisation’s growth shape over the next 36 months.”

The decision framework

A working version, distilled from watching dozens of ML organisations make this choice:

Honest team-size count. Not “we plan to grow to 20” — current ML-engineer headcount, today. Under 15? Default to managed.
Workload diversity audit. If 80% of your work fits the “supervised tabular training” shape, managed handles you. If you have RL, GNNs, custom CUDA kernels, or real-time ranking, escalate.
Latency budget honesty. What’s your p99 SLO? Under 50ms means custom serving regardless of where the rest of your stack lives.
Three-year compute trajectory. If you’re projecting $10M+ compute spend by year three, the managed markup is no longer negligible. Start planning your assembly or build path now.
Governance pressure. Regulated industries — finance, healthcare, government — have data residency and audit requirements that sometimes make managed unworkable. Map your governance constraints before, not after, the contract.

What’s changing in 2026

Three shifts worth knowing about for the rest of the year:

Managed platforms are absorbing more LLM workflows. Databricks Model Serving, SageMaker JumpStart, and Vertex AI Model Garden have all added first-class LLM serving with fine-tuning hooks. For teams whose LLM needs are “host a Llama-3-70B with some PEFT adapters,” managed is now genuinely viable in a way it wasn’t two years ago.
The assembly stack has consolidated. The “right” open-source components for 2026 are noticeably more obvious than they were in 2023 — MLflow for tracking, Ray for distributed compute, KServe or vLLM-Anyscale for serving, Argo Workflows for orchestration, Feast or Tecton for features. The decision-paralysis tax on the middle path has dropped.
“Bring your own cloud” is a real option. Hosted ML platforms (Modal, Anyscale, Together) that run inside the customer’s cloud account are taking real share from both managed-SaaS and custom-build. They’re the right answer for a surprising number of teams that previously would have built.

The high-order question hasn’t changed: how much of your engineering budget can you justify spending on infrastructure that doesn’t ship features? For most teams, the answer is “less than they think,” and the right move is managed. For a handful, the answer is “essentially all of it, eventually,” and the right move is building.

The mistake is picking the romantic option instead of the math.

Further reading: Chip Huyen’s MLOps guide, the Uber Michelangelo evolution post, Spotify’s ML Home post, and the recent Databricks vs SageMaker comparison thread.