How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

Why is linear regression unsuitable for binary classification, and what specific problems does logistic regression fix?

Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.

Simple Linear Regression — GATE DA

What you'll learn

The line model y = mx + c, and the through-origin model y = wx

The squared-error cost Σ(yᵢ − ŷᵢ)² and why we minimize it

The least-squares normal equations, and the closed form w = Σ(xᵢyᵢ)/Σ(xᵢ²) for a line through the origin

Solving a real GATE DA 2025 NAT: fit y = wx to three points

Last lesson ended on a sharp question: out of the infinitely many lines you could draw through a cloud of points, what makes one of them the best fit — and is there a formula that hands it to you? Here is the answer, and it starts with an instinct you already have.

Give a child a scatter of dots and a ruler, and they lay the ruler where it “looks right” — close to as many dots as it can manage. That instinct is the whole idea; we only have to make “looks right” precise. Once it is precise, the vague “close to most dots” becomes a single number to minimise, and minimising a single number is something a formula can do for us. It is also the model every data team reaches for first — the honest baseline a fancier model has to beat before anyone will trust it.

The model and the cost

A line carries the model y = mx + c, where m is the slope and c the intercept. Sometimes we force the line through the origin, y = wx, with a single weight w and no intercept at all. For each data point the line predicts ŷᵢ = m·xᵢ + c, and the gap between truth and prediction, yᵢ − ŷᵢ, is the residual.

Least squares pins down “looks right” by choosing the line that minimises the sum of squared residuals:

Square every vertical gap, add them up, and pick the line that makes the total smallest.

Squaring does two jobs at once. It makes every error positive, so gaps above and below the line cannot cancel out, and it punishes a big miss far more than a small one. In the picture below each shaded square is one residual squared — literally a square whose side is the gap. The least-squares line is the one that makes the total shaded area smallest. Drag the endpoints to try to beat it, then press Solve to snap to the optimum.

Tryleast squares · drag the line

Drag the line to shrink the squares — then let OLS win

Sum of squared errors107.7OLS minimum is 13.7

slope (b)0.350

intercept (a)6.500

R²0.526

Each square's area is one squared residual. OLS is the line that makes the total shaded area as small as possible. Try to beat it by hand — you can get close, but Solve always wins.

The normal equations

Setting the derivative of the cost with respect to each parameter to zero gives the normal equations — the conditions the best line must satisfy. For the through-origin model y = wx there is just one parameter, and the solution is beautifully compact:

Through the origin, the optimal weight is the cross-term sum over the squared-x sum.

For the full line y = mx + c the normal equations give two formulas, but GATE’s 2025 question used the through-origin form above — so that is the one to keep at your fingertips.

How GATE asks this

The bread-and-butter form is a NAT: you are handed 3 to 5 points and asked for the slope (or a prediction) of the least-squares fit. GATE DA 2025 asked exactly this — fit y = wx to three points and report w. The recipe never changes: form Σ xᵢ yᵢ and Σ xᵢ², then divide. The occasional MCQ probes the concept instead — what least squares minimises, or what an outlier does to the line.

Worked example — a real GATE DA 2025 NAT

Fit the model y = wx (through the origin) to the points (−1, 1), (2, −5), and (3, 5) by least squares. Find w. (Real GATE DA 2025 question.)

Build the two sums column by column, then divide:

point        x·y                 x²
(−1,  1)    (−1)(1)  = −1        (−1)² = 1
( 2, −5)    (2)(−5)  = −10        (2)² = 4
( 3,  5)    (3)(5)   =  15        (3)² = 9
            ─────────────        ──────────
   Σ x·y  = −1 − 10 + 15 = 4     Σ x² = 1 + 4 + 9 = 14

        Σ x·y       4
  w  =  ───────  =  ──  =  0.2857…  ≈  0.286
         Σ x²       14

The same arithmetic, written as a few lines of Python, lands on the same number:

xs = [-1, 2, 3]
ys = [ 1, -5, 5]

sxy = sum(x*y for x, y in zip(xs, ys))   # Σ x·y
sxx = sum(x*x for x in xs)               # Σ x²
w   = sxy / sxx

print(f"Σ x·y = {sxy}")
print(f"Σ x²  = {sxx}")
print(f"w     = {w:.4f}")

Σ x·y = 4
Σ x²  = 14
w     = 0.2857

So w ≈ 0.286. Notice the fit passes through none of the three points — it balances all of them so that the squared vertical gaps are collectively smallest, exactly as the drag-the-line widget hinted.

In one breath

Linear regression fits the line y = mx + c (or y = wx through the origin) by least squares — choosing the line that minimises the sum of squared vertical residuals Σ(yᵢ − ŷᵢ)², since squaring both kills sign-cancellation and punishes big misses hardest; setting the cost’s derivative to zero gives the normal equations, which for the through-origin model collapse to the one-line formula w = Σ xᵢyᵢ / Σ xᵢ², and the squaring is also exactly what makes the fit sensitive to outliers.

Practice

Quick check

0/5

Q1Recall — In the model y = mx + c, what does the intercept c represent?

Q2Recall — Which statements about ordinary least-squares linear regression are TRUE? (select all that apply)select all that apply

Q3Trace — A data point currently has a residual of −4 (the line over-predicts by 4). How much does this single point contribute to the squared-error cost?numerical answer — type a number

Q4Trace — Fit y = wx (through the origin) by least squares to the points (1, 3), (2, 4), (3, 8). Find w. (2 decimals)numerical answer — type a number

Q5Apply — For the real 2025 points (−1,1), (2,−5), (3,5), the least-squares through-origin slope is w = 4/14. What is w to 3 decimals?numerical answer — type a number

A question to carry forward

One feature, one slope, one tidy formula. But the house whose price we set out to predict never depended on size alone — it depends on location, on age, on the number of rooms, all at once. A single slope cannot carry all of that.

So the line has to grow: one weight per feature, the prediction becoming a weighted sum of many inputs. The hand-formula Σxy / Σx² will not stretch to cover it. Here is the thread onward: when a model has many features at once, is there still a single closed-form solution for the best weights — and what does the least-squares formula look like once the bookkeeping is done in matrices instead of scalars?

Simple Linear Regression

What you'll learn

Before you start

The model and the cost

Drag the line to shrink the squares — then let OLS win

The normal equations

How GATE asks this

Worked example — a real GATE DA 2025 NAT

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further