datarekha
Infrastructure May 12, 2026

vLLM vs TGI vs SGLang: choosing your inference server in 2026

Three open-source serving stacks, three different bets. vLLM optimises raw throughput. TGI optimises ease and HuggingFace integration. SGLang optimises structured generation and prefix-cache reuse. Here's how to pick.

13 min read · by datarekha · vllmtgisglanginferenceserving

When teams come off “let’s just use the OpenAI API” and start self-hosting, the first decision is almost never the model. It’s the serving stack. Three open-source projects dominate that conversation in 2026: vLLM, TGI (HuggingFace’s Text Generation Inference), and SGLang. They all serve the same Llama-class models. They all support continuous batching. They all expose an OpenAI-compatible endpoint. And in production they behave very differently.

This post is a working comparison — what each project optimises for, the benchmarks that actually matter, and the workloads where each one is the right answer. Vendor benchmarks lie; workload shape doesn’t. We’ll keep coming back to that.

A snapshot, end of Q2 2026

THREE STACKS, THREE BETSvLLMThroughput kingPagedAttentionContinuous batchingFP8 / AWQ / GPTQSpec decodingDisaggregated P/DBerkeley Sky Lab37k starsTGIEase + ecosystemFirst-class HF HubRust + Python splitBuilt-in OTel tracesBattle-tested opsInference EndpointsHuggingFace10k starsSGLangStructured + agentsRadixAttentionNative JSON / regexFrontend DSLDisaggregated P/DDeepSeek defaultLMSYS / Berkeley13k stars
Three projects, overlapping in capabilities but each making a different primary bet. Star counts are approximate as of late Q2 2026.

The headline distinctions:

  • vLLM — the Berkeley Sky Lab project that introduced PagedAttention and continuous batching to the open source world. Optimised for raw throughput on high-QPS chat workloads. The default for production deployments at Anyscale, Together AI, Modal, Baseten and most enterprises self-hosting in 2026.
  • TGI — HuggingFace’s serving stack. The path of least resistance if you’re already on the HF ecosystem. Rust core, Python adapters. Tightly integrated with HF’s Inference Endpoints product. Less raw throughput than vLLM on hot paths, but noticeably less operational pain.
  • SGLang — born from the LMSYS team that runs Chatbot Arena, originally as a frontend DSL for “structured generation programs.” Has since grown into a full serving stack with RadixAttention prefix caching, native JSON-schema constrained decoding, and (as of mid-2025) disaggregated prefill/decode. The serving stack DeepSeek ships their open models on.

What “throughput king” actually means

Most public benchmarks compare two stacks at one batch size on one model and declare a winner. That’s wrong. Throughput in modern LLM serving is a surface in 3D — model size, request rate, and prompt-length distribution all move the answer.

The vLLM team publishes continuously-updated benchmarks on Llama-3-70B at standard request mixes, and on average vLLM beats TGI by 1.3-1.8x on output tokens per second per H100. Independent reproductions (notably the LMSYS engine comparison) confirm the rough ranking — vLLM and SGLang trade leadership at the top depending on workload, with TGI sitting 20-40% behind on raw throughput but ahead on operational simplicity.

The interesting cases are at the edges:

  • Short prompts, short completions, high QPS (think: classification endpoints, intent routers) — vLLM and SGLang both pull away from TGI by ~2x because their schedulers are more aggressive about packing.
  • Long shared prefix, many completions (think: chat with a fixed multi-thousand-token system prompt) — SGLang’s RadixAttention wins, sometimes by 3-5x, because it hashes and reuses KV blocks across all requests that share the prefix.
  • Highly variable prompt lengths — disaggregated prefill/decode (vLLM and SGLang both ship it now) flattens tail latency dramatically; TGI’s unified scheduler hurts more here.

The operational story: TGI’s quiet superpower

If you spend a quarter actually running these in production, the part you end up caring about is not the peak throughput number — it’s how the system behaves at 3am during an incident. TGI’s heritage is HuggingFace’s Inference Endpoints product, which has been quietly serving production traffic for HuggingFace customers since 2022. That shows up as:

  • Sane defaults. TGI ships with reasonable concurrency limits, queue depth caps, and OOM-prevention heuristics that vLLM expects you to tune.
  • First-class OpenTelemetry traces. Every request has timing breakdowns for queuing, prefill, decode, and detokenisation, surfaced to Jaeger/Grafana without extra plumbing.
  • Sharded Rust router. The HTTP layer is in Rust and a single instance can fan out to dozens of model shards. The router is not the bottleneck even at thousands of QPS.
  • Predictable upgrade story. TGI versions are pinned to HuggingFace Transformers minor versions; you upgrade together and break things in predictable ways.

vLLM has closed the operational gap a lot since 2024 — the V1 architecture released in late 2025 cleaned up the scheduler and the API server, and the Prometheus metrics are now first-class. But there’s still a tier of polish that HuggingFace ships by default and vLLM still asks you to assemble.

SGLang’s structured generation, and why agentic workloads love it

The thing SGLang got right early — before vLLM or TGI took it seriously — is that half of production LLM traffic in 2026 is not free-form chat. It’s agentic tool-calling, JSON-schema-constrained output, regex-bounded generation, and “fill this template” workflows.

For all of these, the model is sampling from a constrained distribution. Naively that means rejection sampling: generate a token, check if it’s allowed, retry on rejection. That’s expensive. SGLang’s approach (and now XGrammar’s, which vLLM has also adopted) is to build a finite state machine for the constraint, then mask the logits before sampling so only allowed tokens have non-zero probability.

The wins are large:

  • SGLang’s structured-generation benchmark shows JSON-schema constrained decoding running at 80-95% the speed of free-form generation, versus ~30% for the naive rejection-sampling approach.
  • The hit-rate on RadixAttention prefix-cache is much higher than vLLM’s block-level cache in agentic workloads, because tool-calling agents re-issue the same system prompt + tool definitions on every turn.
  • Constrained decoding interacts well with speculative decoding — the draft model proposes tokens, the constraint mask filters them, and the target model verifies only constraint-satisfying tokens.

If you’re building an agent platform — orchestrator + tools + JSON tool calls — SGLang is the stack that was designed for your workload.

The decision tree

What dominates your traffic?be honest about the 80% casefree-form chat,high QPSHF-native stackagents, JSON,shared prefixesvLLMthroughput, mature schedulerTGIleast operational painSGLangstructured + RadixAttentionFOLLOW-UP QUESTIONSNeed fast iteration on custom kernels?vLLMLatency-bound single-stream chat?vLLM + spec decoding, or SGLangTooling fluency around HF transformers?TGIDeepSeek-R1 or DeepSeek-V3 in prod?SGLang (it’s their reference)
A working decision tree. Skip the benchmarks until you’ve answered the workload question.

A working decision rule for teams I’ve watched pick:

  1. Start with TGI if your team is already deep on HuggingFace tooling. The operational simplicity is worth the 30% throughput gap on most workloads. You can always migrate later.
  2. Reach for vLLM when throughput is the bottleneck. Self-hosting Llama-3-70B for thousands of QPS, or your unit economics depend on tokens-per-dollar — that’s vLLM territory.
  3. Pick SGLang if your workload is shaped like agents. Long shared system prompts, structured JSON outputs, tool-calling loops. The RadixAttention + constrained decoding combination is a clear win that the others haven’t caught up on.

What’s converging, what isn’t

In 2026 the three stacks are converging on a shared set of features — PagedAttention, continuous batching, FP8, speculative decoding, disaggregated prefill/decode are now table stakes in all three. The differences are narrowing on the baseline and widening on the specialised features:

  • TGI is doubling down on managed serving — multi-LoRA serving, fine-tuning hooks, the Inference Endpoints product polish. They’re betting “make hosting easy” beats “wring out the last 20%.”
  • vLLM is betting on hyperscale primitives — multi-node disaggregation, the V1 scheduler refactor, NVIDIA partnership for TensorRT-LLM integration. They’re optimising for fleet operators.
  • SGLang is betting on the agent stack — first-class tool-calling endpoints, multi-step program execution, OpenAI-compatible Responses API support. They’re optimising for the agentic future.

The convergence on baseline means switching cost is lower than people fear — all three expose the same OpenAI-compatible HTTP API. The model weights are the same. You can swap engines under a service mesh.

Takeaway

The vLLM-vs-TGI-vs-SGLang choice in 2026 is no longer “which one is fastest” — they’re all fast enough. It’s which workload shape are you optimising for, and which trade-off can you absorb? Pick by workload, not by benchmark headline.

If you remember one thing: most teams should start with whichever stack their team already knows, ship to production, then benchmark on their actual traffic. The 30% throughput delta will pay for itself in a quarter; the lost two weeks of “we’re rewriting our serving stack again” never will.


Further reading: vLLM project docs, TGI documentation, SGLang docs, the LMSYS engine comparison post, and the SGLang RadixAttention paper.

Skip to content