datarekha

Multiple Linear Regression

One line, many features. The normal equation w = (XᵀX)⁻¹Xᵀy solves it in closed form — a formula GATE wants you to apply to tiny matrices, not to derive.

8 min read Intermediate GATE DA Lesson 79 of 122

What you'll learn

  • Extending the line to many features: the model ŷ = Xw with a bias column
  • The normal equation w = (XᵀX)⁻¹Xᵀy as a formula to apply, not derive
  • Reading off coefficients, and the matrix shapes of X, w, and y
  • Why XᵀX must be invertible (full column rank) for the closed form to exist

Before you start

A house price does not depend on size alone — it also depends on location, age, number of rooms. Multiple linear regression is simple regression with the slope generalized to one weight per feature. The single line becomes a weighted sum of inputs, and the tidy hand-formula for one variable becomes one clean matrix equation.

The model in matrix form

Stack your data so each row of the matrix X is one example and each column is one feature. Add a leading column of 1s — the bias column — so the intercept rides along as just another weight. The predictions for all rows at once are then ŷ = Xw, a single matrix-vector product:

Xn × drows = samples, cols = features·wd × 1one weight / feature=ŷn × 1one prediction / rowinner dims (d)cancel; outerdims give n × 1
Shapes must chain: (n×d)·(d×1) = (n×1). With a bias column, d counts the intercept too.

Each entry of w is a coefficient: holding the other features fixed, it is the change in the prediction per one-unit increase in that feature. The bias weight (the coefficient on the column of 1s) is the intercept.

The normal equation — a formula to apply

Minimizing the squared error Σ(yᵢ − ŷᵢ)² over all the weights at once has a single closed-form answer, the normal equation:

w=( XᵀX )⁻¹Xᵀya d×d matrix, inverteda d×1 vector
Form XᵀX (small, d×d), invert it, multiply by Xᵀy. The product is the optimal weight vector.

GATE wants you to apply this, not derive it. The mechanical recipe: build the small d × d matrix XᵀX, build the vector Xᵀy, invert the matrix, and multiply. For a 2 × 2 matrix the inverse is the familiar (1/det) times the swap-and-negate pattern, so the whole thing is doable by hand.

How GATE asks this

Either a MCQ that asks you to recognize the normal equation w = (XᵀX)⁻¹Xᵀy (or pick the correct matrix shapes), or a NAT that hands you a tiny X and y and asks for one coefficient. With a 2 × 2 matrix XᵀX the inverse is one line of arithmetic. The graders keep the numbers clean precisely so the matrix algebra, not the calculator work, is what is tested.

Worked example — recover a line from three points

Fit y_hat = w0 + w1*x to the points (1, 3), (2, 5), (3, 7) using the normal equation. Find w0 and w1.

With a bias column, the design matrix, weight vector, and target are:

      ⎡ 1  1 ⎤            ⎡ w₀ ⎤          ⎡ 3 ⎤
  X = ⎢ 1  2 ⎥      w =   ⎣ w₁ ⎦     y =  ⎢ 5 ⎥
      ⎣ 1  3 ⎦                            ⎣ 7 ⎦

Build XᵀX (a 2×2) and Xᵀy (a 2-vector). Each entry of XᵀX is a dot product of two columns of X: top-left is column-1 with itself (1·1+1·1+1·1), the off-diagonal is column-1 with column-2 (1·1+1·2+1·3), bottom-right is column-2 with itself (1·1+2·2+3·3):

         ⎡ 1+1+1     1+2+3  ⎤   ⎡  3   6 ⎤
  XᵀX =  ⎣ 1+2+3   1+4+9    ⎦ = ⎣  6  14 ⎦

         ⎡ 3 + 5 + 7        ⎤   ⎡ 15 ⎤
  Xᵀy =  ⎣ 1·3 + 2·5 + 3·7  ⎦ = ⎣ 34 ⎦

Invert the 2×2 (determinant = 3·14 − 6·6 = 42 − 36 = 6):

              1  ⎡ 14  −6 ⎤
  (XᵀX)⁻¹ =  ───  ⎣ −6   3 ⎦
              6

         1  ⎡ 14·15 − 6·34 ⎤    1  ⎡ 210 − 204 ⎤    1  ⎡ 6  ⎤   ⎡ 1 ⎤
  w  =  ───  ⎣ −6·15 + 3·34 ⎦ = ───  ⎣ −90 + 102 ⎦ = ───  ⎣ 12 ⎦ = ⎣ 2 ⎦
         6                       6                     6

So w0 = 1, w1 = 2 — the model is ŷ = 1 + 2x. Check: at x = 1, 2, 3 it gives 3, 5, 7, matching every point exactly (these points are perfectly collinear, so the residuals are zero and the fit is exact).

Quick check

Quick check

0/5
Q1The normal equation for least-squares multiple regression is w = (XᵀX)⁻¹Xᵀy. If X is n×d (with the bias column counted in d), what is the shape of XᵀX?
Q2Using the worked example, the normal equation gave XᵀX = [[3, 6], [6, 14]] and Xᵀy = [15, 34], yielding w = (1, 2). What is the predicted ŷ at x = 4? (model ŷ = w₀ + w₁x)numerical answer — type a number
Q3A normal-equation problem reduces to XᵀX = [[2, 0], [0, 5]] and Xᵀy = [6, 10]. Find the first weight w₀.numerical answer — type a number
Q4Which statements about the normal equation w = (XᵀX)⁻¹Xᵀy are TRUE? (select all that apply)select all that apply
Q5Why does ridge regression add a term λI to form (XᵀX + λI)⁻¹?

Practice this in an interview

All questions
How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

Why is linear regression unsuitable for binary classification, and what specific problems does logistic regression fix?

Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content