What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

What is the difference between discriminative and generative models, and when would you prefer each?

Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

Linear Discriminant Analysis — GATE DA

What you'll learn

LDA is supervised: it uses class labels to find a separating projection

LDA maximises between-class separation relative to within-class scatter

The key contrast: LDA maximises class separability; PCA maximises variance and ignores labels

Why the maximum-variance direction can be a poor class separator

Last lesson ended with a question about shapes. Model each class as a smooth Gaussian cloud in feature space, and if every cloud is allowed its own shape, the boundary between them bends into a curve. But force all the clouds to share one shape — the same covariance — and something clean happens: the curved terms cancel, and the boundary that remains is a perfectly straight line. That collapse from curve to line is exactly why the method earns its first name, Linear Discriminant Analysis.

So LDA keeps naive Bayes’ generative spirit — describe each class, then let Bayes decide — while honouring the correlations naive Bayes threw away, through that shared covariance. But the way GATE actually tests it is from a second, equivalent angle, and it is the one to hold front of mind: LDA as a projection that pulls the classes apart. In practice you reach for LDA when you want to compress labelled data before a classifier, or to see how separable your classes truly are.

What LDA optimises

Projecting high-dimensional data down to a line or plane, the direction you pick decides everything — whether two classes land on top of each other or fall cleanly apart. LDA picks that direction with one goal: keep the classes as far apart as possible. It makes “far apart” precise with two quantities — the between-class scatter (how far the class means sit after projecting) and the within-class scatter (how spread each class is around its own mean) — and maximises their ratio:

        between-class separation
        ─────────────────────────
         within-class scatter

Push the class centres apart, squeeze each class tight, and the projected classes stop overlapping. Because the class means and per-class spreads can only be computed when you know the labels, LDA is inherently supervised — and that is the seam GATE pries at.

LDA vs PCA — same picture, opposite axes

PCA, the unsupervised tool you will meet at the end of this chapter, asks a completely different question: “along which direction does the data vary the most?” It maximises variance and never once looks at the labels. The two directions can point very differently:

PCA follows the overall spread; LDA follows the direction that separates the two classes.

	LDA	PCA
Uses labels?	Yes — supervised	No — unsupervised
Maximises	class separability	variance
Typical use	classification / label-aware reduction	unsupervised dimensionality reduction

How GATE asks this

A conceptual MCQ or MSQ: “what does LDA maximise?”, or a side-by-side “which of the following are true of LDA vs PCA?” testing supervised-versus-unsupervised and separation-versus-variance. The 2026 paper asked exactly this LDA-against-PCA contrast — no eigen-decomposition arithmetic, just the purpose of each method.

Worked example — when max variance is the wrong direction

Picture two classes, each a long thin cloud, sitting side by side and slightly overlapping (like the diagram above). Why might PCA’s top direction be a bad separator while LDA’s direction works?

Walk through what each method “sees”:

PCA looks only at total spread. Both clouds are elongated in roughly the same direction — their shared length — so the single direction of greatest variance runs along that length. PCA picks it, blind to the fact that the two classes are stacked across that direction, not along it.
Project onto the PCA axis and both classes smear over the same interval: the class means barely separate, so the projected clouds overlap heavily. A great variance-capturer, a poor class-separator.
LDA looks at the labels. It notices the class means are offset across the clouds, and that each cloud is narrow in that crosswise direction (small within-class scatter). The separation-to-scatter ratio is largest there, so LDA chooses the crosswise axis.
Project onto the LDA axis and the two class means land far apart while each class stays tight — the classes separate cleanly.

The lesson: the direction of greatest variance is not the direction of greatest class separation. Maximising variance (PCA) and maximising separability (LDA) are different objectives, and they routinely point different ways.

In one breath

Linear Discriminant Analysis models each class as a Gaussian sharing one covariance — which is what makes its boundary linear — and, equivalently, finds the supervised projection that maximises the ratio of between-class separation to within-class scatter, pulling class means apart while keeping each class tight; its whole GATE hook is the contrast with PCA, which is unsupervised and maximises variance while ignoring labels, so the maximum-variance direction (PCA) is routinely not the best class-separating direction (LDA).

Practice

Quick check

0/5

Q1Recall — Which statements correctly distinguish LDA from PCA? (select all that apply)select all that apply

Q2Recall — What does Linear Discriminant Analysis maximise when choosing a projection direction?

Q3Recall — Which is the better single description of LDA's nature?

Q4Apply — You want to reduce 50 features to 2 dimensions specifically so a downstream classifier separates 3 known classes as well as possible. Which method directly targets that goal?

Q5Create — Two elongated, overlapping class clusters lie side by side. Why can PCA's top direction be a poor separator? (select all that apply)select all that apply

A question to carry forward

LDA draws a straight boundary, and so did logistic regression, and so does the Gaussian story behind both. By now a pattern is glaring: many different principles — probability, class scatter — all arrive at a separating line. But when two classes are cleanly separable, there is not one line that splits them; there are infinitely many, a whole pencil of lines all scoring a perfect zero errors.

So the sharper question is no longer “find a separating line” but “find the best one.” Which of those infinitely many perfect separators is safest — the one that leaves the widest empty corridor between the classes, so a new point near the border is least likely to be misjudged? Here is the thread onward: how do you define and maximise that corridor, why does making it as wide as possible come down to shrinking the weight vector, and which surprising handful of points — and only those points — end up deciding where the line goes?

Linear Discriminant Analysis

What you'll learn

Before you start

What LDA optimises

LDA vs PCA — same picture, opposite axes

How GATE asks this

Worked example — when max variance is the wrong direction

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further