datarekha
Infrastructure April 15, 2026

NVIDIA Dynamo, vLLM, SGLang: serving stacks at scale

NVIDIA's Dynamo (GTC 2025) reframes LLM serving around prefill/decode disaggregation. vLLM, SGLang and TensorRT-LLM all sit underneath. Here's how the four-layer stack actually works, what the throughput numbers really mean, and who picks which.

13 min read · by datarekha · dynamovllmsglangserving

When NVIDIA announced Dynamo at GTC 2025, the first reaction from people who’d been running vLLM in production was reasonable: is this their answer to vLLM? Because frankly, NVIDIA had been losing the open-source serving battle for two years. TensorRT-LLM was fastest single-node but operationally painful. vLLM had won the developer mindshare. SGLang had won the workload niches that vLLM didn’t cover. Dynamo looked, at a glance, like a “fight back” play.

A year on, the framing is different and more interesting. Dynamo isn’t a replacement for vLLM or SGLang. It’s a layer above them. Dynamo orchestrates across nodes; vLLM, SGLang, and TensorRT-LLM are the engines that run on each node. The Dynamo blog says it explicitly: it supports all three as backends. The architectural unit of work has moved up the stack.

This post is a tour of the four layers as they actually compose in mid-2026 production, the published throughput numbers behind the comparisons, and a working answer to “who picks which.”

The four layers, in order

SERVING STACK, BOTTOM TO TOPLayer 4 - NVIDIA Dynamomulti-node orchestrator, disaggregated prefill/decode, NIXL KV transferLayer 3 - serving engine (vLLM, SGLang, TensorRT-LLM, TGI)paged KV cache, continuous batching, speculative decoding, schedulerLayer 2 - kernel library (FlashAttention, FireAttention, DeepGEMM, cuDNN)fused attention, matmul, FP8 / FP4 kernelsLayer 1 - hardware (H100, H200, B200, GB200 NVL72)HBM bandwidth, NVLink, NVSwitch, fabric Ethernet
The stack as it composes in 2026. Dynamo is the new top layer; vLLM and friends moved down a notch but didn’t disappear.

The orientation point is the second-from-top box. Layer 3 — the serving engine — is where most teams’ decisions live today. You pick vLLM, or SGLang, or TGI, or TensorRT-LLM, deploy it as a stateless service, scale out behind a load balancer, and call it done. That works fine up to maybe 50-100 GPUs. Past that, the question shifts: how do you orchestrate hundreds of engines, with prefill and decode having fundamentally different compute profiles, and KV caches needing to flow between them? That’s where Dynamo enters.

What Dynamo actually does

The pitch in one sentence: Dynamo is a distributed runtime that lets you build inference deployments where prefill happens on one pool of GPUs, decode happens on a different pool, and the KV cache transfers between them at wire speed. Everything else (scheduling, routing, autoscaling) is in service of that core idea.

The architectural pieces:

  • Disaggregated serving — separate worker pools for prefill (compute-bound, parallel matmul over input tokens) and decode (memory-bound, sequential one-token-at-a-time). The Dynamo design doc walks through the rationale: running both on the same GPU means oscillating between two regimes neither of which is the GPU’s happy path.
  • NIXL — an open-source point-to-point KV transfer library, also announced at GTC 2025. NIXL supports five backends: RDMA/InfiniBand, RoCE via UCX, TCP fallback, NVMe-oF, and S3-compatible object storage. The KV cache moves directly from prefill VRAM to decode VRAM without going through CPU memory or the network stack.
  • KV-aware routing — incoming requests are routed to the worker pool that has the relevant prefix cache already loaded. This is the same idea as prefix caching, but at the multi-node scope.
  • Multi-engine support — Dynamo can drive vLLM, SGLang, TensorRT-LLM, or TGI as the underlying engine. The Dynamo vLLM backend docs describe the integration: Dynamo provides the distributed runtime, vLLM provides the per-node engine, they speak a wire protocol that handles KV transfer.

The headline number from NVIDIA’s Dynamo announcement is up to 30x more requests served on DeepSeek-R1 on Blackwell versus non-disaggregated baselines. Read carefully: that’s specifically R1 (a massive MoE that benefits hugely from disaggregation) on the newest hardware. On more typical workloads (Llama-3-70B, Qwen-72B) the gains are 3-7x on goodput at the 99th percentile, consistent with what the DistServe paper reported in 2024.

The shorter version: Dynamo is real, and it’s the right architecture for hyperscale serving of large reasoning models. But its wins are concentrated on the workloads where prefill cost dominates and tail latency matters most. For “serve Llama-3-8B at 1K QPS,” it’s overkill.

Why disaggregation became the architectural answer

The deepest reason Dynamo exists — and the deepest reason vLLM and SGLang both shipped their own disaggregated-serving modes in 2025 — is that modern LLM workloads have a bimodal compute profile. Prefill (processing the input prompt) is compute-bound: thousands of input tokens get multiplied through the weights in parallel matrix multiplies, saturating the GPU’s tensor cores. Decode (generating output tokens) is memory-bound: one token at a time, each requiring reading the entire model weights and the KV cache from HBM, with arithmetic intensity in single digits.

Running both on the same GPU means oscillating between the two regimes, neither of which is the GPU’s happy path. Worse, a long prefill request causes head-of-line blocking: every other request’s decode latency spikes because the GPU is busy doing 8K-token-prompt prefill for someone else. On a 70B reasoning model with 32K-token prompts, that head-of-line blocking is brutal — p99 TTFT goes up by an order of magnitude under load.

The fix is to physically separate the two pools. Dedicated prefill nodes do nothing but prefill, sized for compute density. Dedicated decode nodes do nothing but decode, sized for memory bandwidth. The KV cache produced by prefill ships to the decode pool over RDMA or NVLink, and decode picks up from where prefill left off. Pioneered in academic work by DistServe and Splitwise in late 2023, this is now the consensus architecture.

What Dynamo brings that the engines individually don’t: a coherent multi-node orchestration of disaggregation. vLLM’s disaggregated mode and SGLang’s disaggregated mode both work, but they’re per-engine. Dynamo sits a layer up and treats the prefill pool and the decode pool as first-class scheduling entities, with autoscaling, KV-cache-aware routing, and failure handling at the orchestration layer.

vLLM, SGLang, TensorRT-LLM — what each still owns

With Dynamo sitting above them, the engine-level comparison still matters, just at a different level of zoom. Here’s the working summary from the 2026 benchmarks:

vLLM. Still the open-source default. The Berkeley Sky Lab project that introduced PagedAttention and continuous batching, now at the V1 architecture and 37k+ stars. It loses some pure-throughput races to TensorRT-LLM (which has more aggressive kernel fusion), but it wins on model coverage, ease of deployment, and ecosystem. Most cloud-native ML platforms (Anyscale, Modal, Baseten) run vLLM by default. If you don’t know what to pick, pick vLLM.

SGLang. Born from the LMSYS team that runs Chatbot Arena. RadixAttention prefix-cache is the distinguishing feature: a token-level radix tree instead of vLLM’s block-level hash, which catches more nested and branching reuse patterns. The published numbers — 29% throughput edge over vLLM on H100 on certain workloads, up to 6.4x on prefix-heavy workloads like RAG and multi-turn chat — explain why it’s become the default for DeepSeek’s open models. Tool-use specifically: SGLang ships first-class tool-call parsers (--tool-call-parser deepseekv3, and similar for Llama, Qwen, etc.) and chat templates that produce more consistent tool-calling behaviour than the vLLM defaults.

TensorRT-LLM. NVIDIA’s in-house engine. Faster single-node throughput than vLLM by 15-30% on H100, according to independent benchmarks, because of aggressive kernel fusion and tightly-tuned attention implementations. The operational cost: a compilation step per model configuration, less flexibility on model architectures, and the assumption that you’re running on NVIDIA hardware (which, in practice, you are). Inside Dynamo, TensorRT-LLM is the highest-performance backend for inference at the cost of some operational rigidity.

TGI. HuggingFace’s serving stack. Still around, still the path of least resistance if you’re already on HuggingFace tooling. Less raw throughput than the others but excellent operational story. Has been moving toward multi-LoRA serving and the HF Inference Endpoints managed product. We covered the full vLLM/TGI/SGLang comparison in a separate post.

The benchmark numbers, in one table

Approximate published throughput on Llama-3-70B at H100 with continuous batching enabled, normalised:

RELATIVE THROUGHPUT, NORMALISED TO vLLM = 1.02.0x1.5x1.0x0.5xvLLM1.00SGLang1.29TRT-LLM1.30TGI0.70Dynamo2.00+
Approximate Llama-3-70B throughput, normalised. Dynamo’s number assumes disaggregated multi-node; on a single node it sits roughly at its underlying engine’s performance.

Caveats worth saying loud: these numbers compress a 3D surface (prompt length, request rate, model size) into a single bar chart, which means they’re directionally useful and tactically misleading. Always benchmark your workload. The vLLM team and the SGLang team trade leadership several times a year. TensorRT-LLM’s lead is real but comes with operational cost. Dynamo’s lead is real but only available at multi-node scale.

Who picks which, in practice

A working decision tree based on what production teams actually deploy:

  • Single-node deployment, mixed open-weight models. vLLM. The mature default. Don’t overthink it.
  • Single-node deployment, NVIDIA stack end-to-end, throughput is everything. TensorRT-LLM. Eat the compilation step, get 15-30% more tokens per dollar.
  • Agent platform, lots of tool-calling, lots of shared system prompts. SGLang. RadixAttention plus first-class tool-call parsers is a clear win that vLLM still doesn’t fully match.
  • HuggingFace tooling already, want to ship in a quarter. TGI. The operational simplicity is worth the 30% throughput gap.
  • Multi-node, hyperscale, MoE or reasoning models, prefill cost is huge. Dynamo. This is the workload it was built for. Wrap vLLM, SGLang, or TensorRT-LLM as the engine.

A useful sanity check before reaching for Dynamo: are you running more than 16 GPUs of the same model, with prompts averaging over 4K tokens, and do you care about p99 TTFT? If you said yes to all three, you’re in the zone where disaggregation pays off. Below that, you’re paying complexity cost for an architecture that solves a problem you don’t have.

The interesting case is the last one. Until late 2025 there was no production-ready way to do disaggregated prefill/decode across nodes without rolling your own orchestration. Teams that wanted it built custom systems on top of vLLM (Anyscale’s RayServe, Together AI’s internal stack, ByteDance’s published work). Dynamo is the first open, NVIDIA-blessed implementation of the architecture, which means it’s becoming the reference. Microsoft Azure has published a deployment guide for Dynamo on AKS with GB200 NVL72; AWS shipped EKS support shortly after.

A concrete deployment, end-to-end

What does a Dynamo-powered deployment actually look like in production? The canonical reference now is Microsoft’s deployment of Dynamo on AKS with GB200 NVL72. The shape:

  • 72 Blackwell GPUs inside a single NVL72 rack, plus a few outside the rack for prefill burst.
  • Prefill pool: a smaller set of GPUs running TensorRT-LLM with paged context attention, sized to handle the compute spikes of long-prompt prefill.
  • Decode pool: the bulk of the GPUs running vLLM workers, sized for memory bandwidth and concurrent request count.
  • NIXL transfers KV between the two pools over the NVL72’s NVLink fabric — bandwidth in the multi-TB/s range, latency in microseconds.
  • A KV-aware router (built into Dynamo) hashes incoming requests’ prefixes and routes them to the decode worker that already has the prefix loaded.
  • Dynamo’s autoscaler scales each pool independently based on its specific load metric (compute saturation for prefill, queue depth for decode).

The result, on DeepSeek-R1 specifically, is the 30x number NVIDIA quotes. On Llama-3-70B and similar non-MoE models the gains are 3-7x on tail latency at high request rates — still dramatic, but bounded by the more modest gap between prefill and decode compute profiles for non-reasoning models. The MoE benefit is largest because MoE prefill is especially compute-heavy and MoE decode benefits especially from being on a memory-optimised pool.

The operational cost is real and worth naming. You’re running two different worker pools, each with its own scaling policy. You’re depending on a high-bandwidth fabric between them (NVLink inside a rack, RDMA or RoCE between racks). You’re trusting Dynamo’s scheduler with a workload that used to fit on a single load balancer. None of this is trivial, and the failure modes are new — a NIXL transfer failure, a prefill-pool OOM under burst, a router miscount. Teams adopting Dynamo report 6-12 weeks of operational shake-out before the architecture genuinely earns its disaggregation premium.

What this means if you’re building serving infrastructure

A few things worth internalising:

  • The engine choice still matters, but less than it used to. When all four engines have PagedAttention, continuous batching, FP8, speculative decoding, and disaggregated prefill/decode as table-stakes features, the differences are at the edges. Pick by workload shape (agents → SGLang, raw throughput → TensorRT-LLM, default → vLLM) and don’t agonise.
  • The orchestrator choice is the new big decision. If you’re running 50+ GPUs, the question “do we adopt Dynamo, or stay on raw engines behind a load balancer?” is the architectural choice that will matter most over the next two years.
  • Prefill/decode disaggregation is the real architectural shift. Whether it ships in your stack via Dynamo, via vLLM’s distributed runtime, or via SGLang’s planned disaggregated support — running prefill and decode on the same GPUs is no longer the right answer at scale.
  • The open source / proprietary line is blurring. Dynamo is open source (Apache 2.0). NIXL is open source. NVIDIA’s stack is becoming more open at the runtime layer even as the hardware stays proprietary. That’s a meaningful shift for teams that have been wary of NVIDIA-specific tooling.

Takeaway

The simplest summary: vLLM is still the right default. Dynamo is the new ceiling. If you’re running a small-to-medium fleet on open weights, nothing has changed — vLLM continues to be the right answer. If you’re running hyperscale, you now have a real production-grade orchestration layer, and the answer to “how do we scale past a single node” has gone from “build it yourself” to “deploy Dynamo with vLLM workers.”

SGLang is the dark horse of the comparison. It punches well above its star count for any workload that looks like agents — tool-calling, shared prefixes, JSON-constrained generation. For the production AI applications that look most like the future (orchestrators dispatching tool calls in loops), SGLang is the engine the workload was designed for.

TensorRT-LLM is the quietly-correct answer when you’re already deep on NVIDIA’s hardware and your unit economics depend on tokens-per-dollar. The compilation step is annoying; the throughput delta is real. For teams running fixed model architectures at hyperscale, the operational trade-off has already been made — they’re on TRT-LLM and they’re not moving.

The interesting part — and the part most “which is fastest” headlines miss — is that these aren’t substitutes anymore. Dynamo orchestrates vLLM. TensorRT-LLM runs inside Dynamo. SGLang exposes a Dynamo-compatible API. The stack is composing, and the right question in 2026 isn’t which one to pick, but how they fit together for your workload.


Further reading: the NVIDIA Dynamo announcement blog, the Dynamo design docs, the vLLM project docs, the SGLang docs, and the DistServe paper that motivated disaggregation together cover the whole landscape.

Skip to content