What is the difference between DDP and FSDP for distributed training?

For MLOps Engineer ML Engineer research-engineer

The short answer

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

How to think about it

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

Learn it properly Multi-GPU: DDP & FSDP

Keep practising

How does DVC differ from a feature store, and when would you reach for each? Why are GPUs used for deep learning instead of CPUs? How do you optimise GPU utilization for model serving, and what role does dynamic batching play? When and how should you trigger model retraining — scheduled vs. event-driven? What is mixed precision training and why does it matter?

All Deep Learning questions

Explore further

Pipeline parallelism Mixed precision Direct Preference Optimization (DPO)

FSDP (Fully Sharded Data Parallel) RDD (Resilient Distributed Dataset) Shuffle Dropout