datarekha

Generative Models: GANs, VAEs & Diffusion

Discriminative models tell cats from dogs; generative models can draw a new cat. Three families do it three different ways — and one of them now powers image AI.

9 min read Advanced Deep Learning Lesson 17 of 17

What you'll learn

  • Why generative models learn p(x) instead of p(y|x), and why that is harder
  • How GANs pit a generator against a discriminator and what mode collapse looks like
  • How VAEs encode data into a smooth latent cloud and why samples come out blurry
  • How diffusion models reverse a noising process step by step — the engine behind Stable Diffusion
  • Which family to reach for given your quality, speed, and stability requirements

Before you start

In 2014 a research group tried to train a neural network to generate realistic bedroom photos. After days of training the network had learned one trick: output a single blurry average of every bedroom it had seen. Ask for 100 samples and you got 100 nearly identical blobs. The model had learned something — but not how to generate the variety that exists in real bedrooms.

That failure has a precise name. The network was modeling p(x) — the full probability distribution over every possible image — and it had collapsed that infinite distribution down to a single point. Fixing it required a completely new family of architectures. Three of those families are now responsible for almost every AI-generated image, voice, and molecule you encounter.

Discriminative vs. generative: the key split

A discriminative model learns p(y | x) — the probability of a label y given an input x. Given a photo, predict “cat” or “dog.” It draws a decision boundary through the data. It never needs to know what a cat looks like from scratch; it only needs to know which side of the boundary a given photo falls on.

A generative model learns p(x) — the probability of the data itself. It must understand the full structure of what a cat image can be so that it can draw a new one by sampling from that distribution. That is a strictly harder problem: you can classify cats without being able to paint one.

Once you have a generative model you get two superpowers:

  1. Sampling — draw brand-new examples from the distribution.
  2. Density estimation — score how likely a given input is (useful for anomaly detection).

Three families have emerged as the dominant approaches, each with a different answer to the question: how do you approximate p(x) with a neural network?


Family 1 — GANs (Generative Adversarial Networks)

Ian Goodfellow’s 2014 insight: instead of writing down a mathematical loss for “looks realistic,” learn the loss from data by training a second network.

A GAN (Generative Adversarial Network) has two networks in opposition:

  • Generator G takes a random noise vector z (sampled from a simple distribution like a standard normal) and outputs a synthetic image.
  • Discriminator D takes an image — real or fake — and outputs a score between 0 and 1: “probability this is real.”

They train together in a minimax game. The generator tries to fool D; the discriminator tries not to be fooled. Formally the generator minimises and the discriminator maximises the same objective, so they are playing a zero-sum game. When training works, G gets so good that D can do no better than random guessing (score = 0.5 for everything).

What can go wrong. GANs are famously hard to train. Two failure modes matter:

  • Mode collapse — the generator finds one or a few outputs that fool the discriminator perfectly and stops exploring. You ask for 1000 diverse faces and get 1000 near-copies of the same face.
  • Training instability — if the discriminator gets too strong too fast, the generator receives near-zero gradient and stops learning.

Despite these issues, well-tuned GANs (StyleGAN2 for faces, BigGAN for ImageNet) produce the sharpest samples of the three families and can generate at interactive speed once trained.


Family 2 — VAEs (Variational Autoencoders)

The 2013 VAE takes a different route: encode data into a probability distribution in a low-dimensional latent space (the compact mathematical space where the model stores its compressed representation of the data), then decode samples from that distribution back into data.

The architecture has two parts:

  • Encoder — maps an input image x to two vectors: a mean mu and a log-variance log_var, both of dimension d (e.g., d = 128). Together they define a Gaussian distribution N(mu, exp(log_var)) in latent space.
  • Decoder — takes a point z sampled from that distribution and reconstructs the image.

The training loss has two terms:

  1. Reconstruction loss — how well does the decoder reproduce the original input? (pixel-wise MSE or cross-entropy.)
  2. KL divergence — how close is the encoder’s latent distribution to a standard normal N(0, 1)? This regularisation term forces the latent space to be smooth and well-organised. “KL divergence” (Kullback-Leibler divergence) measures the difference between two probability distributions; keeping it small prevents the encoder from cheat-memorising each input as a tiny isolated spike.

To generate a new image: sample z from N(0, 1), pass it through the decoder. Because the latent space is smooth, interpolating between two z vectors gives a meaningful transition — not noise.

The cost of smoothness. VAEs produce blurrier images than GANs. The reconstruction loss averages over many plausible images consistent with a latent point, and that averaging shows up as soft edges. VAEs are stable to train, give a structured latent space you can inspect and manipulate, and are the backbone of many drug-discovery and molecular design systems where smoothness matters more than photorealism.


Family 3 — Diffusion Models

Diffusion models, popularised by DDPM (Denoising Diffusion Probabilistic Models, 2020) and now powering Stable Diffusion, DALL-E 3, and Midjourney, take the most roundabout path — and produce the best images.

The forward process — adding noise. Start with a real image. Over T steps (typically T = 1000), gradually mix in Gaussian noise according to a fixed schedule, until at step T you have pure random noise. This process is not learned; it is a mathematical definition.

The reverse process — the only thing learned. Train a neural network (usually a U-Net with attention layers) to predict, given a noisy image at step t, what noise was added at that step. In other words: learn to denoise one step at a time.

Sampling at inference time: start from pure Gaussian noise (step T), apply the learned denoiser 1000 times, and arrive at a clean image. Each step removes a small amount of noise. The model never “draws” an image in one shot; it refines it iteratively.

Why does this work so well? Each denoising step is a relatively easy prediction problem — remove a little noise, not reconstruct an image from nothing. The difficulty is spread across 1000 manageable steps. The model also sees data at every noise level during training, which gives it rich coverage of the data distribution.

The cost. Sampling is slow — 1000 forward passes per image. Practical systems use DDIM or LCM schedulers that reduce this to 20–50 steps with modest quality loss.


Side-by-side diagram

GANVAEDiffusionNoise zGeneratorFakeimageReal dataDiscriminatorReal or fake?Adversarial gradientflows back to GInputimage xEncoderμ, σLatentDecoderrecon xSample z ~ N(0,1)→ Decoder → new imageStable. Blurrier.Smooth latent space.Clean img+ noisestep 1..TPure noiseReverse: start fromnoise, denoise T stepsU-Net denoiser (learned)Highest quality.Slow at inference.Stable Diffusion, DALL-E 3.Sharp. Unstable.Mode collapse risk.

Three generative families: GAN (adversarial duel), VAE (encode to latent cloud then decode), Diffusion (add noise forward; learn to reverse step by step).


When to reach for which

GoalReach for
Fastest inference, sharp faces/artGAN (StyleGAN2, BigGAN)
Structured latent space, interpolation, drug discoveryVAE
Highest quality, text-conditioned generationDiffusion (Stable Diffusion, DALL-E 3)
Stable training on limited dataVAE
Real-time video or game assetsGAN

The common thread

All three families are trying to solve the same problem: approximate a high-dimensional data distribution p(x) with a neural network. They differ in how they set up the learning signal:

  • GAN — implicit: learn the distribution by playing a game.
  • VAE — explicit but approximate: maximise a lower bound on log p(x) using variational inference.
  • Diffusion — explicit and step-wise: maximise the likelihood of predicting the noise added at each step.

Each trade-off is real. GANs are the fastest but most fragile. VAEs are the most principled but blurry. Diffusion models are the most powerful but require the most compute to sample.


Quiz

Next

Explore conditional generation — how to steer a generative model with a class label or text prompt (classifier-free guidance in diffusion, conditional GANs).

Practice this in an interview

All questions
What are the high-level differences between GANs, VAEs, and diffusion models?

GANs train a generator and discriminator adversarially to produce sharp samples but suffer from unstable training and mode collapse. VAEs optimise a tractable evidence lower bound for principled probability modelling but generate blurry samples. Diffusion models iteratively denoise from Gaussian noise, achieving state-of-the-art sample quality and diversity at the cost of slow sampling.

What is the difference between discriminative and generative models, and when would you prefer each?

Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.

What is data augmentation in computer vision and which techniques are most effective?

Data augmentation artificially expands the training set by applying label-preserving transformations to existing images, improving generalisation and regularisation without collecting more data. Geometric transforms (flip, crop, rotation) and colour jitter are universally effective; stronger methods like CutMix, MixUp, and RandAugment consistently improve accuracy on top of basic augmentation.

What is the difference between encoder models like BERT and decoder models like GPT?

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content