Distributed training: FSDP vs DeepSpeed vs Megatron in production
For serious pretraining or fine-tuning, you pick from three: FSDP2 (PyTorch native), DeepSpeed (ZeRO stages), or Megatron-LM (NVIDIA, 3D parallel). The frontier labs have made their bets — Llama 3 went Megatron, FSDP2 is the open-source default under 70B, and DeepSpeed survives where ZeRO offload is necessary. Here's how to choose.
There’s a question that every team scaling past a single GPU asks eventually: should we use FSDP, DeepSpeed, or Megatron-LM? The answer in 2026 isn’t a single winner — it’s a sorted hierarchy by use case. The frontier labs have made their bets and the receipts are public.
Llama 3 405B trained on a custom stack centered on Megatron-LM with 4D parallelism, achieving 400 TFLOPs/GPU on 16,000 H100s. Mistral’s training stack is Megatron-based. DeepSeek-V3’s training combined Megatron-style tensor parallelism with their own MoE-aware sharding. Below the frontier labs, the open-source community lives in two camps: FSDP2 is the default for 7B-30B fine-tuning on a handful of nodes, and DeepSpeed survives where ZeRO offload to CPU or NVMe is necessary because the GPU memory budget is tight.
This post is the practical map. What each framework actually does under the hood, what the production trade-offs are, and the decision tree that fits real GPU budgets.
What “distributed training” actually means
Before the framework comparison, the model-parallelism dimensions need to be on the table:
- Data parallelism (DP) — every GPU has a full model copy; each gets a different microbatch; gradients are all-reduced. Simplest, works until the model exceeds GPU memory.
- Tensor parallelism (TP) — individual matrix multiplications are sharded across GPUs. A single layer’s compute happens on multiple GPUs cooperatively. Needs fast interconnect (NVLink). Limited to within-node for performance reasons (usually 2-8 GPUs).
- Pipeline parallelism (PP) — different layers live on different GPUs. Microbatches flow through the pipeline. Adds bubble overhead but scales across nodes well.
- Expert parallelism (EP) — for MoE models, different experts live on different GPUs. The current frontier for sparse models.
- Sequence parallelism (SP) — within a TP group, activations are sharded along the sequence dimension. Saves activation memory.
- ZeRO sharding — partition optimizer states, gradients, and parameters across DP ranks. ZeRO-1 shards optimizer states only; ZeRO-2 adds gradients; ZeRO-3 adds parameters (this is what FSDP and DeepSpeed-3 both do).
The framework comparison is essentially: which of these dimensions does each framework let you compose easily, and what are the ergonomics like?
FSDP2: the modern open-source default
FSDP started as PyTorch’s answer to ZeRO-3, and for a long time it was “DeepSpeed but worse on stability.” That changed in 2024 with the release of FSDP2, which rewrote the internals on top of PyTorch’s new DTensor abstraction. The visible-from-outside difference is that parameters are sharded per-parameter rather than flattened and concatenated together, which sounds like a small detail and isn’t.
Why per-parameter sharding matters in practice: it composes cleanly
with other PyTorch primitives. You can mix-and-match dtype per layer,
freeze individual parameters for LoRA without rewriting the sharding
logic, and use torch.compile end-to-end (which gives meaningful
speedups in 2026 that earlier sharded implementations couldn’t capture).
The PyTorch FSDP2 tutorial
shows a 70B Llama fine-tune fitting on 8 H100s with LoRA — that’s a
realistic single-node setup that didn’t work cleanly with FSDP1.
The benchmarks that matter: Hugging Face’s comparison of FSDP and
DeepSpeed
shows FSDP2 running with up to ~5x higher per-iteration throughput
than DeepSpeed ZeRO-3 in the regime where both fit memory, mainly
because of the cleaner integration with torch.compile and fewer
copy steps in the gradient all-reduce path.
What FSDP2 still gives up: there’s no integrated NVMe offload, the
pipeline-parallel story is still maturing (you reach for PyTorch’s
PiPPy library, which is good but separate), and tensor parallelism
needs you to drop down to torch.distributed primitives directly.
For training a 7B-30B model on a few nodes, none of these matter; for
training a 200B model on a thousand nodes, all of them do.
FSDP2 is what most people writing new training code in 2026 should reach for first. The bar to switch away from it is “FSDP2 measurably doesn’t fit my memory budget” or “I need 3D parallelism.”
The one caveat to the FSDP2 default: there’s still a body of training
code in the wild that pre-dates FSDP2 and uses FSDP1. The migration
from FSDP1 to FSDP2 is mostly mechanical (the new fully_shard API
replaces FullyShardedDataParallel), but it’s not free, and the bugs
that show up during migration tend to involve mixed-precision and
optimizer state handling. The PyTorch team has been steadily improving
the migration story; by mid-2026 most internal Meta code is on FSDP2.
DeepSpeed: still the offload king
DeepSpeed introduced ZeRO before PyTorch had FSDP, and for a window of about three years it was the only way to train models meaningfully larger than your GPU memory. The framework has matured, accumulated features (Mixture of Experts, MII inference, etc.), and now sits in an interesting position: it’s no longer the fastest at the things FSDP2 also does, but it’s still the only framework with first-class CPU and NVMe offload for parameters, gradients, and optimizer states.
Why offload still matters: a 70B model in fp16 needs 140GB for parameters alone, plus another 140GB for gradients, plus 560GB for Adam optimizer states (fp32 master + first + second moments). That’s 840GB of GPU memory just for the static state — before activations. On 8 H100s (640GB total), the model doesn’t fit. With ZeRO-3 + CPU offload, the optimizer states live in CPU RAM (which is cheap and plentiful) and you can fit the training step. With NVMe offload, you can go further — DeepSpeed-Infinity demonstrated training a 175B model on a single GPU by paging optimizer state to NVMe.
The cost is throughput. CPU offload roughly halves training speed. NVMe offload makes it 10x slower. But for teams that need to fine-tune a 70B model on the GPU budget they actually have (rather than the one they wish they had), DeepSpeed’s offload is the difference between the project being possible and not.
The other reason DeepSpeed survives: it’s the easiest path to Microsoft’s training-and-serving stack. DeepSpeed-Chat, DeepSpeed-MII, and DeepSpeed-FastGen are tightly integrated, and if your stack is on Azure ML, DeepSpeed is the path-of-least-resistance.
What DeepSpeed gives up vs FSDP2: ergonomics. The config-file approach that made DeepSpeed accessible in 2021 now looks dated next to FSDP2’s “just wrap your model” Python API. Stability has improved but is still historically worse for long pretraining runs than DeepSpeed users acknowledge — Meta’s documented FSDP experience on Llama models is the source of “FSDP is less stable” lore, but the fix on FSDP2 closed most of that gap.
Megatron-LM: the frontier-lab choice
NVIDIA’s Megatron-LM is the framework every published frontier model uses for the pretraining portion of its lifecycle. Llama 3 405B trained on a Megatron-derived stack. Mistral’s models train on Megatron. NVIDIA’s own Nemotron trained on Megatron. The reason isn’t ergonomics — it’s that Megatron-LM is the only widely-available framework with production-quality tensor + pipeline + data + sequence parallelism composed together (sometimes called 3D or 4D parallelism).
For a 405B model, you need all four dimensions to fit on any realistic cluster. The Llama 3 paper reports using:
- 8-way tensor parallelism (within a node, over NVLink)
- 16-way pipeline parallelism (across nodes)
- 128-way data parallelism
- Sequence parallelism within the tensor groups
Composed, that’s 8 × 16 × 128 = 16,384 GPUs, which is the size of the cluster. The 4D parallelism is what makes the math work — no other single dimension would have scaled.
Megatron also wins on raw throughput within its sweet spot. The Llama 3 ISCA paper reports 400 TFLOPs/GPU for the 405B at 8K sequence length and 380 TFLOPs/GPU at 131K sequence length. These are extraordinary numbers — close to the H100’s theoretical peak. FSDP2 on the same hardware would top out at 150-200 TFLOPs because it can’t compose tensor parallelism on the critical path.
What Megatron gives up: ergonomics, and how. Setting up a Megatron
training job is a multi-day exercise in YAML files, partition
strategies, and --tensor-model-parallel-size flags. The codebase
assumes you have a dedicated infrastructure team. There’s a reason no
one fine-tunes a 7B model on Megatron — it’s like using a Boeing
factory to build a bicycle.
The honest take: if you’re training at the frontier, you use Megatron (or you fork it, like DeepSeek did). If you’re not, you don’t.
A footnote on the alternatives. The frontier-lab segment isn’t a Megatron monoculture — there are a handful of internal frameworks (Google’s pathways, Anthropic’s training stack, etc.) that don’t show up in the public discussion because they’re not open-sourced. The common thread is that they all solve the same problem Megatron solves (composable tensor + pipeline + data parallelism at 10K-GPU scale), just with different bets on which abstractions matter most. For the open-source community, Megatron is the only realistic option in this class.
Hugging Face Accelerate: the wrapper that lets you swap
The interesting development in 2024-2025 was that Hugging Face’s
Accelerate library
became a viable wrapper over all three. You write your training loop
once, then accelerate config lets you switch between FSDP, DeepSpeed,
and Megatron-LM with config changes only — no code rewrite.
This matters for prototyping: you can start with FSDP2 on a single node, hit a memory wall, switch to DeepSpeed ZeRO-3 with offload, and later move to Megatron if you scale further. The Accelerate docs cover all three backends with Llama-70B SFT examples on 8×H100.
The trade is that Accelerate’s coverage of the most advanced features of each backend is imperfect. If you want to use Megatron’s most aggressive sequence-parallel optimizations, you’ll be back in Megatron’s own launcher. If you want DeepSpeed-Infinity NVMe offload, you’ll need DeepSpeed-specific configuration. Accelerate is the 80% solution — and 80% is enough for most teams.
The actual decision
What I’ve watched work across teams shipping real training jobs in 2025-2026:
What to take away
After watching three years of the FSDP-vs-DeepSpeed-vs-Megatron debate, the conclusions that hold up are:
- FSDP2 is the modern default for 7B-30B fine-tuning. PyTorch-native, fast, and the ergonomics dominate. Reach for it first.
- DeepSpeed survives where ZeRO offload is necessary — usually 30B-70B fine-tuning on a tight memory budget. CPU and NVMe offload remain the unique feature.
- Megatron-LM is the only realistic choice at frontier scale. Llama 3, Mistral, Mixtral, DeepSeek, Nemotron — every published frontier model trained on Megatron or a fork of it. Below 70B you don’t need it; above 100B you don’t have a real alternative.
- Accelerate is the wrapper that lets you swap. Useful, has its limits, but a sensible default for code you want to be backend-agnostic.
The honest take that nobody likes to say out loud: the framework choice matters less than the engineering team’s familiarity with it. A team that’s deeply fluent in DeepSpeed will produce a better training run on DeepSpeed than they would on a marginally better framework they’ve never used. The frontier labs use Megatron because they have the engineering depth to run Megatron; the open-source community uses FSDP2 because PyTorch ergonomics are what most teams have. The right framework is the one your team can actually operate.
Further reading: the PyTorch FSDP2 tutorial, DeepSpeed’s documentation, the Megatron-LM repo, and Hugging Face’s FSDP vs DeepSpeed concept guide. For the Llama 3 training details, the Scaling Llama 3 ISCA paper is the best published reference.