What are the high-level differences between GANs, VAEs, and diffusion models?

GANs train a generator and discriminator adversarially to produce sharp samples but suffer from unstable training and mode collapse. VAEs optimise a tractable evidence lower bound for principled probability modelling but generate blurry samples. Diffusion models iteratively denoise from Gaussian noise, achieving state-of-the-art sample quality and diversity at the cost of slow sampling.

What is the difference between discriminative and generative models, and when would you prefer each?

Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.

What is data augmentation in computer vision and which techniques are most effective?

Data augmentation artificially expands the training set by applying label-preserving transformations to existing images, improving generalisation and regularisation without collecting more data. Geometric transforms (flip, crop, rotation) and colour jitter are universally effective; stronger methods like CutMix, MixUp, and RandAugment consistently improve accuracy on top of basic augmentation.

What is the difference between encoder models like BERT and decoder models like GPT?

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

Generative Models: GANs, VAEs & Diffusion — Deep Learning

In 2014 a research group tried to train a neural network to generate realistic bedroom photos. After days of training the network had learned one trick: output a single blurry average of every bedroom it had seen. Ask for 100 samples and you got 100 nearly identical blobs. The model had learned something — but not how to generate the variety that exists in real bedrooms.

That failure has a precise name. The network was modeling p(x) — the full probability distribution over every possible image — and it had collapsed that infinite distribution down to a single point. Fixing it required a completely new family of architectures. Three of those families are now responsible for almost every AI-generated image, voice, and molecule you encounter.

Discriminative vs. generative: the key split

A discriminative model learns p(y | x) — the probability of a label y given an input x. Given a photo, predict “cat” or “dog.” It draws a decision boundary through the data. It never needs to know what a cat looks like from scratch; it only needs to know which side of the boundary a given photo falls on.

A generative model learns p(x) — the probability of the data itself. It must understand the full structure of what a cat image can be so that it can draw a new one by sampling from that distribution. That is a strictly harder problem: you can classify cats without being able to paint one.

Once you have a generative model you get two superpowers:

Sampling — draw brand-new examples from the distribution.
Density estimation — score how likely a given input is (useful for anomaly detection).

Three families have emerged as the dominant approaches, each with a different answer to the question: how do you approximate p(x) with a neural network?

Family 1 — GANs (Generative Adversarial Networks)

Ian Goodfellow’s 2014 insight: instead of writing down a mathematical loss for “looks realistic,” learn the loss from data by training a second network.

A GAN (Generative Adversarial Network) has two networks in opposition:

Generator G takes a random noise vector z (sampled from a simple distribution like a standard normal) and outputs a synthetic image.
Discriminator D takes an image — real or fake — and outputs a score between 0 and 1: “probability this is real.”

They train together in a minimax game. The generator tries to fool D; the discriminator tries not to be fooled. Formally the generator minimises and the discriminator maximises the same objective, so they are playing a zero-sum game. When training works, G gets so good that D can do no better than random guessing (score = 0.5 for everything).

What can go wrong. GANs are famously hard to train. Two failure modes matter:

Mode collapse — the generator finds one or a few outputs that fool the discriminator perfectly and stops exploring. You ask for 1000 diverse faces and get 1000 near-copies of the same face.
Training instability — if the discriminator gets too strong too fast, the generator receives near-zero gradient and stops learning.

Despite these issues, well-tuned GANs (StyleGAN2 for faces, BigGAN for ImageNet) produce the sharpest samples of the three families and can generate at interactive speed once trained.

Aside: GANs for oversampling imbalanced data

GANs aren’t only for images. A practical use on tabular data is fixing class imbalance — say 1,000 fraud cases against 200,000 legitimate ones. The classic fix is SMOTE, which manufactures synthetic minority rows by linearly interpolating between a sample and its nearest neighbours. That works, but interpolation assumes the minority class is locally convex and can smear across real decision boundaries.

A conditional GAN (or the tabular-specific CTGAN) instead learns the minority distribution and samples genuinely new points from it — capturing the non-linear structure and feature correlations SMOTE’s straight-line blends miss. You generate extra fraud-like rows, add them to training, and the classifier sees a balanced problem.

Family 2 — VAEs (Variational Autoencoders)

The 2013 VAE takes a different route: encode data into a probability distribution in a low-dimensional latent space (the compact mathematical space where the model stores its compressed representation of the data), then decode samples from that distribution back into data.

The architecture has two parts:

Encoder — maps an input image x to two vectors: a mean mu and a log-variance log_var, both of dimension d (e.g., d = 128). Together they define a Gaussian distribution N(mu, exp(log_var)) in latent space.
Decoder — takes a point z sampled from that distribution and reconstructs the image.

The training loss has two terms:

Reconstruction loss — how well does the decoder reproduce the original input? (pixel-wise MSE or cross-entropy.)
KL divergence — how close is the encoder’s latent distribution to a standard normal N(0, 1)? This regularisation term forces the latent space to be smooth and well-organised. “KL divergence” (Kullback-Leibler divergence) measures the difference between two probability distributions; keeping it small prevents the encoder from cheat-memorising each input as a tiny isolated spike.

To generate a new image: sample z from N(0, 1), pass it through the decoder. Because the latent space is smooth, interpolating between two z vectors gives a meaningful transition — not noise.

The cost of smoothness. VAEs produce blurrier images than GANs. The reconstruction loss averages over many plausible images consistent with a latent point, and that averaging shows up as soft edges. VAEs are stable to train, give a structured latent space you can inspect and manipulate, and are the backbone of many drug-discovery and molecular design systems where smoothness matters more than photorealism.

Family 3 — Diffusion Models

Diffusion models, popularised by DDPM (Denoising Diffusion Probabilistic Models, 2020) and now powering Stable Diffusion, DALL-E 3, and Midjourney, take the most roundabout path — and produce the best images.

The forward process — adding noise. Start with a real image. Over T steps (typically T = 1000), gradually mix in Gaussian noise according to a fixed schedule, until at step T you have pure random noise. This process is not learned; it is a mathematical definition.

The reverse process — the only thing learned. Train a neural network (usually a U-Net with attention layers) to predict, given a noisy image at step t, what noise was added at that step. In other words: learn to denoise one step at a time.

Sampling at inference time: start from pure Gaussian noise (step T), apply the learned denoiser 1000 times, and arrive at a clean image. Each step removes a small amount of noise. The model never “draws” an image in one shot; it refines it iteratively.

Why does this work so well? Each denoising step is a relatively easy prediction problem — remove a little noise, not reconstruct an image from nothing. The difficulty is spread across 1000 manageable steps. The model also sees data at every noise level during training, which gives it rich coverage of the data distribution.

The cost. Sampling is slow — 1000 forward passes per image. Practical systems use DDIM or LCM schedulers that reduce this to 20–50 steps with modest quality loss.

Side-by-side diagram

Three generative families: GAN (adversarial duel), VAE (encode to latent cloud then decode), Diffusion (add noise forward; learn to reverse step by step).

When to reach for which

Goal	Reach for
Fastest inference, sharp faces/art	GAN (StyleGAN2, BigGAN)
Structured latent space, interpolation, drug discovery	VAE
Highest quality, text-conditioned generation	Diffusion (Stable Diffusion, DALL-E 3)
Stable training on limited data	VAE
Real-time video or game assets	GAN

The common thread

All three families are trying to solve the same problem: approximate a high-dimensional data distribution p(x) with a neural network. They differ in how they set up the learning signal:

GAN — implicit: learn the distribution by playing a game.
VAE — explicit but approximate: maximise a lower bound on log p(x) using variational inference.
Diffusion — explicit and step-wise: maximise the likelihood of predicting the noise added at each step.

Each trade-off is real. GANs are the fastest but most fragile. VAEs are the most principled but blurry. Diffusion models are the most powerful but require the most compute to sample.

In one breath

Discriminative models learn p(y|x) (draw a boundary); generative models learn p(x) (the data itself) so they can sample new examples — a strictly harder problem.
GANs pit a generator against a discriminator in a minimax game: sharpest, fastest samples, but fragile — prone to mode collapse and training instability.
VAEs encode data into a smooth Gaussian latent (reconstruction + KL loss), so you can interpolate and score likelihood (ELBO) — stable and structured, but blurrier because reconstruction averages.
Diffusion learns to reverse a fixed noising process one step at a time — highest quality and most diverse, but slow to sample (~1000 steps, cut to 20–50 with DDIM/LCM); modern systems run it in a VAE latent (latent diffusion) or as flow matching. All three approximate the same p(x); they differ only in how they set up the learning signal.

Quiz

Quick check

0/3

Q1A GAN finishes training but every sample from the generator looks almost identical, even though the training data is diverse. What failure mode is this?

Q2A VAE is trained on face images. At inference you sample two latent vectors z1 and z2 (one for a young face, one for an old face) and interpolate halfway between them: z_mid = 0.5 * z1 + 0.5 * z2. What do you expect to see when you decode z_mid?

Q3You are designing a system to detect fraudulent medical scans by scoring how likely a scan is under the data distribution. Which generative family is the best fit and why?

Explore conditional generation — how to steer a generative model with a class label or text prompt (classifier-free guidance in diffusion, conditional GANs).

Generative Models: GANs, VAEs & Diffusion

What you'll learn

Before you start

Discriminative vs. generative: the key split

Family 1 — GANs (Generative Adversarial Networks)

Aside: GANs for oversampling imbalanced data

Family 2 — VAEs (Variational Autoencoders)

Family 3 — Diffusion Models

Side-by-side diagram

When to reach for which

The common thread

In one breath

Quiz

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further