Serverless LLM platforms: Modal, Together AI, Fireworks, Replicate

When the second wave of LLM applications started shipping in 2024, the deployment question was binary: call an API (OpenAI, Anthropic) or run your own (a Kubernetes cluster of A100s with vLLM, an SRE team, and a quarterly capacity-planning meeting). Both options had serious problems. The API was expensive at scale and locked you into someone else’s roadmap. Self-hosting was operationally heavy and uneconomic until you were running thousands of QPS.

The middle ground that emerged — serverless LLM platforms — is the deployment pattern that won 2025-2026 for sub-frontier-scale inference. Four companies dominate the conversation: Modal, Together AI, Fireworks, and Replicate. They look superficially alike. Their actual bets are very different. This post is a working comparison of what each one is good at, where the cost lines fall, and who picks which.

The map

Four platforms, four different bets. Modal and Replicate are billed by the second of GPU time; Together and Fireworks bill per token like OpenAI.

The defining bet of each platform:

Modal — your code, their GPUs. Decorate a Python function with @app.function(gpu="H100"), push, and Modal runs it on demand with per-second billing. Not specifically an LLM platform; it’s a serverless GPU platform on top of which you run vLLM or anything else you want. Best for teams that need flexibility over a managed model catalog.
Together AI — a model marketplace. 200+ open models available as serverless inference, plus dedicated endpoints for custom or fine-tuned models. Per-token pricing. Best for teams that want “open-source GPT” — a fast easy way to swap from a closed API to an open model with the same OpenAI-compatible interface.
Fireworks — speed at any cost. Custom CUDA kernels (FireAttention), FP8/FP4 native deploys, first-class B200/H200 support. Per-token pricing in the same ballpark as Together. Best when latency or throughput per dollar is the deciding factor.
Replicate — model-as-API for everyone. Lower the barrier to “I want to call an AI model” to nearly zero. Owned by Cloudflare since late 2025. Best for prosumer apps, mostly image and audio, less the right answer for high-volume LLM workloads.

What “serverless” actually means here

The term is overloaded. Across these four platforms it covers two fundamentally different billing models:

Per-token (Together, Fireworks). You pay for output. Provider handles all the infrastructure, batching, autoscaling. Cost is predictable: tokens × rate.
Per-second-of-GPU (Modal, Replicate). You pay for GPU time, including any idle time during the request. Provider gives you a more flexible runtime; you handle the application logic. Cost depends on workload shape.

Per-token is right when you’re serving a popular open model at scale — the provider’s batching across all its customers gives you better utilisation than you could achieve alone. Per-second is right when your workload is bursty, weird, or wants control. The two pricing models diverge sharply on different traffic patterns; we’ll get to the numbers.

Cold start, the part the marketing skips

Per-second platforms have to solve a problem per-token platforms don’t: loading the model into GPU memory. For Llama-3-70B in FP8, that’s ~80GB of weights, and a cold start naively takes 60-90 seconds even with NVMe SSD reads. For an unloved model serving 10 QPS, that means every request sees a fresh boot, and the user waits a minute. Unworkable.

Modal’s solution is GPU memory snapshots. The serving process is allowed to warm up once, then Modal snapshots the entire memory state — including the GPU’s HBM — to disk. Subsequent cold starts restore from the snapshot instead of re-initialising. The numbers Modal publishes:

Generic LLM cold starts go from ~2 minutes to ~10 seconds, almost an order of magnitude faster.
vLLM running Qwen2.5-0.5B-Instruct goes from 45s to 5s (P0).
The fastest cases (small models with no compilation) hit 2-second cold starts.

Replicate has its own version with what they call “fast booting fine-tunes” — compiled in advance, snapshot-restored. Cold starts in the 5-15s range for typical LLM-class models. Together and Fireworks effectively hide cold starts behind their warm pool: they always have replicas of popular models running, so per-token requests never see a model-load delay. The trade is that unpopular models on Together/Fireworks have longer first- request latency than warmed-up Modal would.

The per-token cost line, in actual 2026 dollars

The market settled into a remarkable pricing convergence by mid-2026. Public per-token prices for a common reference model (Llama-3.3-70B, serverless tier):

Llama-3.3-70B blended cost (input + output, mid-2026 rates) across providers. Self-hosted-equivalent platforms cluster around $0.90/M; closed-frontier APIs sit an order of magnitude higher.

The convergence is not coincidence. The serverless market for open weights is effectively perfect competition: same model, similar hardware, visible prices, switching cost low (the API is OpenAI-compatible). Margins have compressed to the cost of compute plus a small operator margin. Fireworks and Together AI sit a few cents apart; Artificial Analysis tracks the differences quarter by quarter and they keep getting smaller.

The interesting comparison isn’t between Fireworks and Together — they’re effectively a tie. The interesting comparison is to OpenAI, where the equivalent task costs roughly 10-15x more for output tokens of similar quality (Llama-3.3-70B vs GPT-class). For workloads that don’t strictly need a frontier model, the open-weights-on-serverless path is the single biggest cost saving in your inference bill.

What each platform is genuinely best at

Modal is best at: “I want a GPU for arbitrary Python code.” Custom inference pipelines, training jobs, multi-step image processing, anything that doesn’t fit into the “call this model with these tokens” mold. Pricing example: an H100 at ~$3.95/hr, billed by the second. If you can keep utilisation above 60-70%, the per-token cost competes with the managed platforms; below that, you’re paying for idle. Best of class: fine-tuning runs, batch inference of unusual models, anything where the code matters more than the model. Their serverless inference best-practices blog walks through the workload patterns.

Together AI is best at: “I want OpenAI’s API but for open weights.” The cleanest path from a closed-API prototype to an open-source production deployment. Their serverless API is OpenAI-compatible; you change a base URL and a model name. They also ship dedicated endpoints for custom models — upload a LoRA from HuggingFace, deploy in minutes, pay per-instance instead of per-token. The “Turbo” path uses FP8 quantisation to deliver up to 4.5x vLLM’s throughput on Llama models. For Llama-3.1-8B specifically, Together publishes 400 tok/s output.

Fireworks is best at: “I need this to be fast, and I care about the last 20%.” FireAttention V1 claimed 4x vLLM on Mixtral; V2 extended the win to long-context inference (12x on context-heavy workloads); V3 brought AMD MI300 support. The custom-kernel approach means Fireworks is consistently at or near the top of the Artificial Analysis latency rankings. For latency-bound chat, voice, or interactive coding, this matters.

Replicate is best at: “I want to ship a feature, fast, in a side project.” Their entire UX is built around discovering and trying existing models with copy-paste code. Heavy on image generation, audio, video. Less competitive for high-volume LLM inference (per-second billing on dedicated hardware works against you for chat-style workloads), but unbeaten if your app is “I generate a custom poster from a user prompt and a photo.” Now owned by Cloudflare, which suggests deeper Workers integration is coming.

A cost line that breaks people’s intuition

Here’s a comparison most teams haven’t sat down with: per-second-billed platforms vs per-token-billed platforms for the same actual workload.

Take a midweight workload: a customer-facing chat feature, ~200 requests per minute average, ~5x spiky peak, ~1.5K input / 500 output tokens per request. Llama-3.3-70B model.

Per-token (Together AI Turbo):

Cost: ~$1.10 per 1K requests at the published rate
Steady-state: ~$13/hour, ~$320/day, ~$9.5K/month
Burst handling: provider absorbs the burst at the same per-token rate
Cold start: hidden (warm pool)

Per-second (Modal H100):

Need ~2 H100s to handle peak load: ~$8/hour for the warm pool, but scale-to-zero during dead air drops the average
With Modal’s snapshot-restore cold start (~5-10s for vLLM-on-Llama-70B), you can scale very aggressively
Actual measured cost in a similar real deployment: ~$5K/month — roughly half of per-token
But: you’re responsible for the deployment, the load balancing, the autoscaling triggers

The per-second platform is cheaper if you manage utilisation well and your traffic shape is even moderately predictable. The per-token platform is cheaper if you can’t keep utilisation up or your traffic is bursty enough that you’d have warm idle GPUs most of the time. For most mid-volume applications (10K-1M requests/day), the per-second platform is the optimisation worth doing — the unit economics are noticeably better.

The break-even shifts depending on model size, traffic shape, and how much engineering effort you’re willing to pour in. The Modal best-practices guide walks through the math for several workload patterns; it’s worth the read before deciding.

When to pick which (the decision)

A working flowchart based on what real teams actually pick:

You’re calling a popular open model (Llama, Qwen, Mixtral, DeepSeek) and want OpenAI-compatible APIs. Together AI or Fireworks. Try both, pick the one that’s faster or cheaper on your workload by 10%. They tie on most.
You need latency-critical inference (voice agents, code completion, real-time chat). Fireworks first, by latency. Fall back to Groq (cheapest, but tighter model selection) if your model is on their list.
You’re running fine-tunes or custom models. Together (for serverless dedicated endpoints) or Modal (for full control). Together is easier; Modal is more flexible.
You’re running anything that isn’t pure “call this LLM with these tokens” — multi-step pipelines, training, batch jobs, image-to-image to LLM workflows. Modal. Nothing else matches the “GPU as Python function” UX.
You’re shipping a prosumer-grade feature, mostly image or audio, low volume. Replicate. Has the model and the easiest API.
You’re at a scale where in-platform routing matters more than the platform itself. Use OpenRouter or Portkey on top of Fireworks/Together. Multi-provider fallback is more important than absolute provider choice.

The thing I see teams get wrong: picking a platform before knowing their workload shape. A team building a voice agent and a team running batch document analysis should not be on the same platform. They have fundamentally different latency, throughput, and cost profiles. Match the platform to the workload, not the reverse.

The cost lever you can pull right now

A worked back-of-envelope cost saving for an actual team I worked with recently. Legacy stack:

5M API requests/month, average 2K input / 600 output tokens
All on GPT-4o, ~$2.50 input / $10 output per million tokens
Inference spend: ~$30K/month

Migration plan (executed over a quarter):

Move 70% of requests to Llama-3.3-70B on Fireworks ($0.88/M blended, ~3x cheaper than GPT-4o-mini, ~12x cheaper than GPT-4o)
Keep 30% on GPT-4o for the queries that needed it (verified by an eval suite)
Pre-fine-tune Llama on the team’s domain (Cohere-style behavioural fine-tune; see our fine-tuning post) for ~$200 of compute
Add OpenRouter as a fallback for outages

Outcome:

Blended monthly inference cost dropped from ~$30K to ~$8K (~73% reduction)
p95 latency improved (Fireworks’s FireAttention is faster than GPT-4o on output tokens)
Quality on an internal eval set: -2.1% versus all-GPT-4o, statistically marginal

Same story, slightly different rounds, has been playing out across the industry. The combination of serverless platforms hitting per-token parity with self-hosted economics, plus the open-weight model wave catching up to frontier closed models in capability, is the single biggest cost-reduction lever available to most teams in 2026.

Takeaway

Three takeaways for the wall:

Pick by workload shape, not by brand. Modal for code, Together for marketplace, Fireworks for speed, Replicate for prosumer. The differences are real and matter.
Per-token platforms (Together, Fireworks) are the right default for “I just want to call a model.” Switch to per-second (Modal) when your code is more interesting than the model.
The migration from closed-API to open-weight-on-serverless is the cheapest 70-80% cost reduction most teams will see in 2026. If you’re still paying $15/M output tokens for things Llama-70B can do, you’re leaving roughly 90% of your inference budget on the table.

The shape of the serverless LLM market in 2026 is: a tight cluster of near-identical-pricing providers on open weights, with differentiation happening on the dimensions that aren’t the headline price — latency, operational polish, ecosystem fit, support for custom models. Pick by what you’ll actually feel in production, not by the price-per-million banner. The price-per-million banners are all within 10% of each other anyway.

Further reading: Modal’s serverless inference best-practices guide, Together AI’s launch post for dedicated endpoints, the Fireworks FireAttention V2 blog, Replicate’s pricing docs, and the Artificial Analysis provider benchmarks for the current head-to-head numbers.