datarekha

What is the difference between DDP and FSDP for distributed training?

The short answer

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

How to think about it

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

Learn it properly Multi-GPU: DDP & FSDP

Keep practising

All Deep Learning questions

Explore further

Skip to content