What is the difference between DDP and FSDP for distributed training?

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Why are GPUs used for deep learning instead of CPUs?

Neural network training is dominated by large matrix multiplications that are embarrassingly parallel. GPUs have thousands of small cores optimised for this exact operation, whereas CPUs have tens of powerful cores optimised for low-latency sequential logic. The throughput difference is 10–100x for typical DL workloads.

What is mixed precision training and why does it matter?

Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.

Multi-GPU: DDP & FSDP — Deep Learning

At some point one GPU isn’t enough — either training is too slow, or the model simply doesn’t fit in memory. Multi-GPU training is now a baseline skill, and the whole subject comes down to one question you can answer with arithmetic: does the model’s training state fit on a single GPU? The answer picks your strategy.

First, the memory math

People underestimate training memory because they only count the weights. Training a model with the Adam optimizer in mixed precision costs roughly 16 bytes per parameter:

bf16 weights        2 bytes
bf16 gradients      2 bytes
fp32 master weights 4 bytes   ← the "master copy" for stable updates
Adam moment  m      4 bytes
Adam moment  v      4 bytes
                   --------
                   16 bytes / parameter

So a 7B-parameter model needs about 7e9 × 16 = 112 GB just for model state — already over an 80GB GPU, before activations. That single number drives the whole decision:

DDP — replicate the model, split the batch

Distributed Data Parallel is the workhorse when the model does fit. Every GPU holds a full copy of the model; you split each batch across the GPUs, each computes gradients on its shard, and then an all-reduce averages the gradients so every replica stays in sync before the optimizer step.

DDP: full model on every GPU, batch split, gradients averaged by all-reduce so replicas stay identical.

DDP gives near-linear speedup but no memory savings — each GPU still holds the entire model state. It’s the right tool when the model fits and you just want to train faster on a bigger effective batch.

# torchrun --nproc_per_node=4 train.py
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(model.to(local_rank), device_ids=[local_rank])
# ... the normal training loop; DDP all-reduces gradients during backward()

FSDP — shard the model when it won’t fit

When the full state exceeds one GPU, Fully Sharded Data Parallel splits the parameters, gradients, and optimizer states across the GPUs. Each GPU stores only its 1/N shard and gathers the full weights for a layer only momentarily, during that layer’s forward/backward, then frees them. Per-GPU memory drops roughly linearly with the number of GPUs — which is exactly what lets you train a 70B model that could never fit on one card.

Tensor parallelism — when one layer won’t fit

Data parallelism assumes a single GPU can hold one copy of the model. At frontier scale even one layer’s weight matrices are too big. Tensor parallelism (TP) splits an individual matrix multiply across GPUs.

Take an FFN’s Y = X · W where W is huge. Slice W by columns into W₁, W₂ on two GPUs; each computes X · Wᵢ to get half the output columns, then you concatenate. The next matrix is sliced by rows to match, so its partial outputs are summed with a single all-reduce. Attention shards naturally too: put different heads on different GPUs.

The catch is communication. TP does an all-reduce inside every layer, on every step, so the GPUs must be joined by very fast interconnect (NVLink within a node). That’s why TP is kept within a server (say 8 GPUs) while pipeline and data parallelism span across servers, where the network is slower and you can tolerate less frequent syncs.

So the 3D recipe at frontier scale: TP inside a node, PP across a few nodes to fit the model’s depth, DP/FSDP on top to scale throughput. The same idea reappears at inference — vLLM’s tensor_parallel_size shards a model that won’t fit on one GPU across several, with an all-reduce per layer.

In one breath

Multi-GPU is one arithmetic question: does the model’s training state fit on one GPU? Mixed-precision Adam costs ~16 bytes/parameter (2 weights + 2 grads + 4 fp32 master + 4 + 4 Adam), so 7B ≈ 112 GB.
If it fits: DDP — replicate the full model on every GPU, split the batch, all-reduce the gradients. Near-linear speedup, no memory savings.
If it doesn’t: FSDP — shard params/grads/optimizer states across GPUs, gathering a layer’s weights only momentarily. Per-GPU memory drops ~1/N, which is what lets a 70B model fit.
Beyond data parallelism: tensor parallelism splits one matmul across GPUs (all-reduce every layer → keep within an NVLink node); pipeline parallelism puts different layers on different GPUs.
Frontier scale combines all three (3D parallelism: TP in a node, PP across nodes, DP/FSDP on top) — but DDP or FSDP covers almost everything you’ll build.

Quick check

0/3

Q1Roughly how much memory does training a 7B-parameter model need for model state with mixed-precision Adam?

Q2What's the key difference between DDP and FSDP?

Q3Your model's training state is 200 GB and you have 8× 80GB GPUs. What should you do?

That completes the training half of the track. Next we move to the architecture that everything modern is built on — starting with how text becomes tensors in tokenization, then self-attention and the transformer block.

Multi-GPU: DDP & FSDP

What you'll learn

Before you start

First, the memory math

DDP — replicate the model, split the batch

FSDP — shard the model when it won’t fit

Tensor parallelism — when one layer won’t fit

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further