Multiple Linear Regression
One line, many features. The normal equation w = (XᵀX)⁻¹Xᵀy solves it in closed form — a formula GATE wants you to apply to tiny matrices, not to derive.
What you'll learn
- Extending the line to many features: the model ŷ = Xw with a bias column
- The normal equation w = (XᵀX)⁻¹Xᵀy as a formula to apply, not derive
- Reading off coefficients, and the matrix shapes of X, w, and y
- Why XᵀX must be invertible (full column rank) for the closed form to exist
Before you start
A house price does not depend on size alone — it also depends on location, age, number of rooms. Multiple linear regression is simple regression with the slope generalized to one weight per feature. The single line becomes a weighted sum of inputs, and the tidy hand-formula for one variable becomes one clean matrix equation.
The model in matrix form
Stack your data so each row of the matrix X is one example and each
column is one feature. Add a leading column of 1s — the bias column — so
the intercept rides along as just another weight. The predictions for all rows at
once are then ŷ = Xw, a single matrix-vector product:
Each entry of w is a coefficient: holding the other features fixed, it is the
change in the prediction per one-unit increase in that feature. The bias weight
(the coefficient on the column of 1s) is the intercept.
The normal equation — a formula to apply
Minimizing the squared error Σ(yᵢ − ŷᵢ)² over all the weights at once has a
single closed-form answer, the normal equation:
GATE wants you to apply this, not derive it. The mechanical recipe: build the
small d × d matrix XᵀX, build the vector Xᵀy, invert the
matrix, and multiply. For a 2 × 2 matrix the inverse is the familiar
(1/det) times the swap-and-negate pattern, so the whole thing is doable by hand.
How GATE asks this
Either a MCQ that asks you to recognize the normal equation
w = (XᵀX)⁻¹Xᵀy (or pick the correct matrix shapes), or a
NAT that hands you a tiny X and y and asks for one coefficient. With a
2 × 2 matrix XᵀX the inverse is one line of arithmetic. The graders keep the
numbers clean precisely so the matrix algebra, not the calculator work, is what is
tested.
Worked example — recover a line from three points
Fit
y_hat = w0 + w1*xto the points(1, 3),(2, 5),(3, 7)using the normal equation. Findw0andw1.
With a bias column, the design matrix, weight vector, and target are:
⎡ 1 1 ⎤ ⎡ w₀ ⎤ ⎡ 3 ⎤
X = ⎢ 1 2 ⎥ w = ⎣ w₁ ⎦ y = ⎢ 5 ⎥
⎣ 1 3 ⎦ ⎣ 7 ⎦
Build XᵀX (a 2×2) and Xᵀy (a 2-vector). Each entry of XᵀX is a dot product of
two columns of X: top-left is column-1 with itself (1·1+1·1+1·1), the
off-diagonal is column-1 with column-2 (1·1+1·2+1·3), bottom-right is column-2
with itself (1·1+2·2+3·3):
⎡ 1+1+1 1+2+3 ⎤ ⎡ 3 6 ⎤
XᵀX = ⎣ 1+2+3 1+4+9 ⎦ = ⎣ 6 14 ⎦
⎡ 3 + 5 + 7 ⎤ ⎡ 15 ⎤
Xᵀy = ⎣ 1·3 + 2·5 + 3·7 ⎦ = ⎣ 34 ⎦
Invert the 2×2 (determinant = 3·14 − 6·6 = 42 − 36 = 6):
1 ⎡ 14 −6 ⎤
(XᵀX)⁻¹ = ─── ⎣ −6 3 ⎦
6
1 ⎡ 14·15 − 6·34 ⎤ 1 ⎡ 210 − 204 ⎤ 1 ⎡ 6 ⎤ ⎡ 1 ⎤
w = ─── ⎣ −6·15 + 3·34 ⎦ = ─── ⎣ −90 + 102 ⎦ = ─── ⎣ 12 ⎦ = ⎣ 2 ⎦
6 6 6
So w0 = 1, w1 = 2 — the model is ŷ = 1 + 2x. Check: at
x = 1, 2, 3 it gives 3, 5, 7, matching every point exactly (these points are
perfectly collinear, so the residuals are zero and the fit is exact).
Quick check
Quick check
Practice this in an interview
All questionsOLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.
OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.
Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.