What are the high-level differences between GANs, VAEs, and diffusion models?
GANs train a generator and discriminator adversarially to produce sharp samples but suffer from unstable training and mode collapse. VAEs optimise a tractable evidence lower bound for principled probability modelling but generate blurry samples. Diffusion models iteratively denoise from Gaussian noise, achieving state-of-the-art sample quality and diversity at the cost of slow sampling.
How to think about it
All three are deep generative models — they learn to model the data distribution p(x) so they can generate new samples from it. They differ in how they do it.
GANs (Generative Adversarial Networks)
Two networks in competition: a generator G that maps noise to samples, and a discriminator D that classifies real vs fake. G is trained to fool D; D is trained to catch G.
- Sharp, high-frequency details because D penalises any statistical artifact.
- Mode collapse: G can win by generating a narrow subset of the distribution.
- Training instability: the balance between G and D is fragile; vanishing gradients in D kill the generator gradient.
- Evaluation: FID score measures distributional similarity.
VAEs (Variational Autoencoders)
Encode input to a distribution over a latent space (q(z|x)) and decode samples from that distribution back to data. The loss is a reconstruction term plus a KL-divergence penalty that keeps the latent space close to a standard Gaussian.
- Principled probability model with a tractable objective.
- Smooth, interpolable latent space — great for controllable generation.
- Generated samples tend to be blurry because the reconstruction loss averages over uncertainty.
Diffusion models (DDPMs, Score Matching)
Gradually add Gaussian noise to data (forward process) and train a neural network to reverse this process step by step (denoising). At inference, start from pure noise and iteratively denoise.
- Best sample quality and diversity of the three (as of 2024).
- Slow sampling: 50–1000 denoising steps, though DDIM and consistency models reduce this.
- No mode collapse; stable training via a simple MSE objective.
- Underpins Stable Diffusion, DALL-E 3, Sora.
Side-by-side
| GAN | VAE | Diffusion | |
|---|---|---|---|
| Training stability | Fragile | Stable | Stable |
| Sample sharpness | Excellent | Blurry | Excellent |
| Latent space control | Weak | Strong | Moderate |
| Sampling speed | Fast (1 pass) | Fast (1 pass) | Slow (N steps) |
| Mode coverage | Poor (collapse risk) | Good | Excellent |
| State of the art in 2024 | Audio (GAN-TTS) | Rarely | Images, video, audio |