datarekha
Deep Learning Hard Asked at OpenAIAsked at GoogleAsked at Stability AIAsked at NVIDIAAsked at Meta

What are the high-level differences between GANs, VAEs, and diffusion models?

The short answer

GANs train a generator and discriminator adversarially to produce sharp samples but suffer from unstable training and mode collapse. VAEs optimise a tractable evidence lower bound for principled probability modelling but generate blurry samples. Diffusion models iteratively denoise from Gaussian noise, achieving state-of-the-art sample quality and diversity at the cost of slow sampling.

How to think about it

All three are deep generative models — they learn to model the data distribution p(x) so they can generate new samples from it. They differ in how they do it.

GANs (Generative Adversarial Networks)

Two networks in competition: a generator G that maps noise to samples, and a discriminator D that classifies real vs fake. G is trained to fool D; D is trained to catch G.

  • Sharp, high-frequency details because D penalises any statistical artifact.
  • Mode collapse: G can win by generating a narrow subset of the distribution.
  • Training instability: the balance between G and D is fragile; vanishing gradients in D kill the generator gradient.
  • Evaluation: FID score measures distributional similarity.

VAEs (Variational Autoencoders)

Encode input to a distribution over a latent space (q(z|x)) and decode samples from that distribution back to data. The loss is a reconstruction term plus a KL-divergence penalty that keeps the latent space close to a standard Gaussian.

  • Principled probability model with a tractable objective.
  • Smooth, interpolable latent space — great for controllable generation.
  • Generated samples tend to be blurry because the reconstruction loss averages over uncertainty.

Diffusion models (DDPMs, Score Matching)

Gradually add Gaussian noise to data (forward process) and train a neural network to reverse this process step by step (denoising). At inference, start from pure noise and iteratively denoise.

  • Best sample quality and diversity of the three (as of 2024).
  • Slow sampling: 50–1000 denoising steps, though DDIM and consistency models reduce this.
  • No mode collapse; stable training via a simple MSE objective.
  • Underpins Stable Diffusion, DALL-E 3, Sora.

Side-by-side

GANVAEDiffusion
Training stabilityFragileStableStable
Sample sharpnessExcellentBlurryExcellent
Latent space controlWeakStrongModerate
Sampling speedFast (1 pass)Fast (1 pass)Slow (N steps)
Mode coveragePoor (collapse risk)GoodExcellent
State of the art in 2024Audio (GAN-TTS)RarelyImages, video, audio
Learn it properly Generative Models

Keep practising

All Deep Learning questions

Explore further

Skip to content