How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

Why is linear regression unsuitable for binary classification, and what specific problems does logistic regression fix?

Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.

Multiple Linear Regression — GATE DA

What you'll learn

Extending the line to many features: the model ŷ = Xw with a bias column

The normal equation w = (XᵀX)⁻¹Xᵀy as a formula to apply, not derive

Reading off coefficients, and the matrix shapes of X, w, and y

Why XᵀX must be invertible (full column rank) for the closed form to exist

Last lesson left the line straining against its own limit: a single slope, when a house price answers to size and location and age and room count, all at once. One number cannot carry four influences. So the line has to grow a weight for each feature — and, happily, there is still a single closed-form solution waiting on the other side.

Multiple linear regression is simple regression with the lone slope generalised to one weight per feature. The single line becomes a weighted sum of inputs, and the tidy scalar hand-formula Σxy / Σx² becomes one clean matrix equation. The bookkeeping moves from numbers to matrices, but the idea — minimise the squared error — does not budge an inch.

The model in matrix form

Stack your data so each row of the matrix X is one example and each column is one feature. Then add a leading column of 1s — the bias column — so the intercept rides along as just another weight. The predictions for every row at once become ŷ = Xw, a single matrix-vector product:

Shapes must chain: (n×d)·(d×1) = (n×1). With a bias column, d counts the intercept too.

Each entry of w is a coefficient: holding the other features fixed, it is the change in the prediction per one-unit increase in that feature. The bias weight — the coefficient on the column of 1s — is the intercept the previous lesson called c.

The normal equation — a formula to apply

Minimising the squared error Σ(yᵢ − ŷᵢ)² over all the weights at once has a single closed-form answer, the normal equation:

Form XᵀX (small, d×d), invert it, multiply by Xᵀy. The product is the optimal weight vector.

GATE wants you to apply this, not derive it. The mechanical recipe: build the small d × d matrix XᵀX, build the vector Xᵀy, invert the matrix, and multiply. For a 2 × 2 matrix the inverse is the familiar (1/det) times the swap-and-negate pattern, so the whole thing is doable by hand.

How GATE asks this

Either a MCQ that asks you to recognise the normal equation w = (XᵀX)⁻¹Xᵀy (or pick the correct matrix shapes), or a NAT that hands you a tiny X and y and asks for one coefficient. With a 2 × 2 matrix XᵀX the inverse is one line of arithmetic. The graders keep the numbers clean on purpose, so the matrix algebra — not the calculator work — is what gets tested.

Worked example — recover a line from three points

Fit ŷ = w₀ + w₁·x to the points (1, 3), (2, 5), (3, 7) using the normal equation. Find w₀ and w₁.

With a bias column, the design matrix, weight vector, and target are:

      ⎡ 1  1 ⎤            ⎡ w₀ ⎤          ⎡ 3 ⎤
  X = ⎢ 1  2 ⎥      w =   ⎣ w₁ ⎦     y =  ⎢ 5 ⎥
      ⎣ 1  3 ⎦                            ⎣ 7 ⎦

Build XᵀX (a 2×2) and Xᵀy (a 2-vector). Each entry of XᵀX is a dot product of two columns of X: top-left is column-1 with itself (1·1+1·1+1·1), the off-diagonal is column-1 with column-2 (1·1+1·2+1·3), and bottom-right is column-2 with itself (1·1+2·2+3·3):

         ⎡ 1+1+1     1+2+3  ⎤   ⎡  3   6 ⎤
  XᵀX =  ⎣ 1+2+3   1+4+9    ⎦ = ⎣  6  14 ⎦

         ⎡ 3 + 5 + 7        ⎤   ⎡ 15 ⎤
  Xᵀy =  ⎣ 1·3 + 2·5 + 3·7  ⎦ = ⎣ 34 ⎦

Invert the 2×2 (determinant = 3·14 − 6·6 = 42 − 36 = 6):

              1  ⎡ 14  −6 ⎤
  (XᵀX)⁻¹ =  ───  ⎣ −6   3 ⎦
              6

         1  ⎡ 14·15 − 6·34 ⎤    1  ⎡ 210 − 204 ⎤    1  ⎡ 6  ⎤   ⎡ 1 ⎤
  w  =  ───  ⎣ −6·15 + 3·34 ⎦ = ───  ⎣ −90 + 102 ⎦ = ───  ⎣ 12 ⎦ = ⎣ 2 ⎦
         6                       6                     6

So w₀ = 1, w₁ = 2 — the model is ŷ = 1 + 2x. The same steps in NumPy confirm it:

import numpy as np

X = np.array([[1, 1],
              [1, 2],
              [1, 3]], dtype=float)   # bias column + one feature
y = np.array([3, 5, 7], dtype=float)

XtX = X.T @ X
Xty = X.T @ y
w   = np.linalg.inv(XtX) @ Xty

print("XtX =", XtX.tolist())
print("Xty =", Xty.tolist())
print("w   =", w.tolist())

XtX = [[3.0, 6.0], [6.0, 14.0]]
Xty = [15.0, 34.0]
w   = [1.0, 2.0]

Check it against the data: at x = 1, 2, 3 the model gives 3, 5, 7, matching every point exactly. These three points are perfectly collinear, so the residuals are all zero and the fit is exact — just as the prediction prompt suggested.

In one breath

Multiple linear regression stacks the data into a design matrix X (rows = samples, columns = features, plus a leading bias column for the intercept) and predicts all rows at once as ŷ = Xw; minimising the squared error over every weight together has the single closed-form normal equation w = (XᵀX)⁻¹Xᵀy, which you apply by forming the small d×d matrix XᵀX, inverting it, and multiplying by Xᵀy — and which only works when XᵀX is invertible, i.e. when no feature column is a linear combination of the others.

Practice

Quick check

0/5

Q1Recall — The normal equation for least-squares multiple regression is w = (XᵀX)⁻¹Xᵀy. If X is n×d (with the bias column counted in d), what is the shape of XᵀX?

Q2Recall — Which statements about the normal equation w = (XᵀX)⁻¹Xᵀy are TRUE? (select all that apply)select all that apply

Q3Trace — Using the worked example, the normal equation gave w = (1, 2), so ŷ = w₀ + w₁x. What is the predicted ŷ at x = 4?numerical answer — type a number

Q4Trace — A normal-equation problem reduces to XᵀX = [[2, 0], [0, 5]] and Xᵀy = [6, 10]. Find the first weight w₀.numerical answer — type a number

Q5Apply — Why does ridge regression add a term λI to form (XᵀX + λI)⁻¹?

A question to carry forward

The normal equation is a marvel: one shot, no iteration, the exact best weights. But read its cost again — it forms XᵀX and then inverts it. Inverting a d × d matrix costs on the order of d³ operations, and the inverse only exists when the columns are independent. With a handful of features that is nothing. With ten thousand features, or columns that very nearly collide, the one-shot formula becomes slow, memory-hungry, or simply undefined.

So we need a second route to the same minimum — one that never forms an inverse, never even builds XᵀX, but instead creeps toward the best weights a little at a time. Here is the thread onward: if you are standing somewhere on the bowl-shaped cost surface and want to reach its bottom, which direction is downhill, how far should you step, and what single update rule turns that into an algorithm?

Multiple Linear Regression

What you'll learn

Before you start

The model in matrix form

The normal equation — a formula to apply

How GATE asks this

Worked example — recover a line from three points

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further