Linear Discriminant Analysis
A supervised projection that pulls classes apart: LDA maximises between-class separation relative to within-class scatter. Its GATE hook is the contrast with PCA.
What you'll learn
- LDA is supervised: it uses class labels to find a separating projection
- LDA maximises between-class separation relative to within-class scatter
- The key contrast: LDA maximises class separability; PCA maximises variance and ignores labels
- Why the maximum-variance direction can be a poor class separator
Before you start
When you project high-dimensional data down to a line or plane, the direction you pick decides everything. Linear Discriminant Analysis (LDA) picks that direction with one goal in mind: keep the classes as far apart as possible. It is a supervised method — it looks at the labels — which is exactly what separates it from PCA, and exactly the distinction GATE tested in 2026. In practice you reach for LDA when you want to compress labelled data before a classifier, or to visualise how separable your classes really are.
What LDA optimises
LDA searches for a projection direction along which the classes are well separated. “Well separated” is made precise with two quantities: the between-class scatter (how far apart the class means are after projecting) and the within-class scatter (how spread out each class is around its own mean). LDA maximises the ratio
between-class separation
─────────────────────────
within-class scatter
Push the class centres apart, squeeze each class tight, and the projected classes stop overlapping. Because the class means and per-class spreads can only be computed when you know the labels, LDA is inherently supervised. The result can be used directly as a classifier or as a label-aware way to reduce dimensions before another model.
LDA vs PCA — same picture, opposite axes
PCA, which you may have met as an unsupervised tool, asks a completely different question: “along which direction does the data vary the most?” It maximises variance and never looks at the labels. The two directions can point very differently:
| LDA | PCA | |
|---|---|---|
| Uses labels? | Yes — supervised | No — unsupervised |
| Maximises | class separability | variance |
| Typical use | classification / label-aware reduction | unsupervised dimensionality reduction |
How GATE asks this
A conceptual MCQ or MSQ: “what does LDA maximise?”, or a side-by-side “which of the following are true of LDA vs PCA?” testing supervised-vs-unsupervised and separation-vs-variance. The 2026 paper asked exactly this LDA-versus-PCA contrast — no eigen-decomposition arithmetic, just the purpose of each method.
Worked example — when max variance is the wrong direction
Picture two classes, each a long thin cloud, sitting side by side and slightly overlapping (like the diagram above). Why might PCA’s top direction be a bad separator while LDA’s direction works?
Walk through what each method “sees”:
- PCA looks only at total spread. Both clouds are elongated in roughly the same direction (their shared length), so the single direction of greatest variance runs along that length. PCA picks it — and it is blind to the fact that the two classes are stacked across that direction, not along it.
- Project onto the PCA axis and both classes smear over the same interval: the class means barely separate, so the projected clouds overlap heavily. A great variance-capturer, a poor class-separator.
- LDA looks at the labels. It notices the class means are offset across the clouds, and that each cloud is narrow in that crosswise direction (small within-class scatter). The separation-to-scatter ratio is largest there, so LDA chooses the crosswise axis.
- Project onto the LDA axis and the two class means land far apart while each class stays tight — the classes separate cleanly.
The lesson: the direction of greatest variance is not the direction of greatest class separation. Maximising variance (PCA) and maximising separability (LDA) are different objectives and routinely point different ways.
Quick check
Quick check
Practice this in an interview
All questionsPCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.
L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.
Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.
t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.