datarekha

Linear Discriminant Analysis

A supervised projection that pulls classes apart: LDA maximises between-class separation relative to within-class scatter. Its GATE hook is the contrast with PCA.

7 min read Intermediate GATE DA Lesson 88 of 122

What you'll learn

  • LDA is supervised: it uses class labels to find a separating projection
  • LDA maximises between-class separation relative to within-class scatter
  • The key contrast: LDA maximises class separability; PCA maximises variance and ignores labels
  • Why the maximum-variance direction can be a poor class separator

Before you start

When you project high-dimensional data down to a line or plane, the direction you pick decides everything. Linear Discriminant Analysis (LDA) picks that direction with one goal in mind: keep the classes as far apart as possible. It is a supervised method — it looks at the labels — which is exactly what separates it from PCA, and exactly the distinction GATE tested in 2026. In practice you reach for LDA when you want to compress labelled data before a classifier, or to visualise how separable your classes really are.

What LDA optimises

LDA searches for a projection direction along which the classes are well separated. “Well separated” is made precise with two quantities: the between-class scatter (how far apart the class means are after projecting) and the within-class scatter (how spread out each class is around its own mean). LDA maximises the ratio

        between-class separation
        ─────────────────────────
         within-class scatter

Push the class centres apart, squeeze each class tight, and the projected classes stop overlapping. Because the class means and per-class spreads can only be computed when you know the labels, LDA is inherently supervised. The result can be used directly as a classifier or as a label-aware way to reduce dimensions before another model.

LDA vs PCA — same picture, opposite axes

PCA, which you may have met as an unsupervised tool, asks a completely different question: “along which direction does the data vary the most?” It maximises variance and never looks at the labels. The two directions can point very differently:

PCA axis(max variance)LDA axis(max separation)Project onto PCA and the classes overlap; project onto LDA and they pull apart.
PCA follows the overall spread; LDA follows the direction that separates the two classes.
LDAPCA
Uses labels?Yes — supervisedNo — unsupervised
Maximisesclass separabilityvariance
Typical useclassification / label-aware reductionunsupervised dimensionality reduction

How GATE asks this

A conceptual MCQ or MSQ: “what does LDA maximise?”, or a side-by-side “which of the following are true of LDA vs PCA?” testing supervised-vs-unsupervised and separation-vs-variance. The 2026 paper asked exactly this LDA-versus-PCA contrast — no eigen-decomposition arithmetic, just the purpose of each method.

Worked example — when max variance is the wrong direction

Picture two classes, each a long thin cloud, sitting side by side and slightly overlapping (like the diagram above). Why might PCA’s top direction be a bad separator while LDA’s direction works?

Walk through what each method “sees”:

  1. PCA looks only at total spread. Both clouds are elongated in roughly the same direction (their shared length), so the single direction of greatest variance runs along that length. PCA picks it — and it is blind to the fact that the two classes are stacked across that direction, not along it.
  2. Project onto the PCA axis and both classes smear over the same interval: the class means barely separate, so the projected clouds overlap heavily. A great variance-capturer, a poor class-separator.
  3. LDA looks at the labels. It notices the class means are offset across the clouds, and that each cloud is narrow in that crosswise direction (small within-class scatter). The separation-to-scatter ratio is largest there, so LDA chooses the crosswise axis.
  4. Project onto the LDA axis and the two class means land far apart while each class stays tight — the classes separate cleanly.

The lesson: the direction of greatest variance is not the direction of greatest class separation. Maximising variance (PCA) and maximising separability (LDA) are different objectives and routinely point different ways.

Quick check

Quick check

0/5
Q1Which statements correctly distinguish LDA from PCA? (select all that apply)select all that apply
Q2What does Linear Discriminant Analysis maximise when choosing a projection direction?
Q3Two elongated, overlapping class clusters lie side by side. Why can PCA's top direction be a poor separator? (select all that apply)select all that apply
Q4You want to reduce 50 features to 2 dimensions specifically so a downstream classifier separates 3 known classes as well as possible. Which method directly targets that goal?
Q5Which is the better single description of LDA's nature?

Practice this in an interview

All questions
What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

What is the difference between discriminative and generative models, and when would you prefer each?

Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.

What are t-SNE and UMAP, how do they differ from PCA, and what are their limitations for ML workflows?

t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content