Simple Linear Regression
Fit the best straight line through points by minimizing squared vertical errors — the least-squares solution you compute by hand. A 2025 NAT, step by step.
What you'll learn
- The line model y = mx + c, and the through-origin model y = wx
- The squared-error cost Σ(yᵢ − ŷᵢ)² and why we minimize it
- The least-squares normal equations, and the closed form w = Σ(xᵢyᵢ)/Σ(xᵢ²) for a line through the origin
- Solving a real GATE DA 2025 NAT: fit y = wx to three points
Before you start
Give a child a scatter of dots and a ruler, and they will lay the ruler where it “looks right” — close to as many dots as possible. Linear regression is that instinct made precise: find the single straight line whose total error against the points is as small as possible. The only thing we must pin down is what “smallest error” means. It is also the model every data team reaches for first — the honest baseline a fancier model has to beat before anyone trusts it.
The model and the cost
A line has the model y = mx + c, where m is the slope and c the
intercept. Sometimes we force the line through the origin, y = wx, with a
single weight w and no intercept. For each data point the line predicts
ŷᵢ = m·xᵢ + c, and the gap between truth and prediction,
yᵢ − ŷᵢ, is the residual.
Least squares chooses the line that minimizes the sum of squared residuals:
Squaring does two jobs: it makes every error positive (so gaps above and below the line cannot cancel out), and it punishes big misses far more than small ones. In the picture below, each shaded square is one residual squared — literally a square whose side is the gap. The least-squares line is the one that makes the total shaded area as small as possible. Drag the endpoints to try to beat it, then hit Solve to snap to the optimum.
The normal equations
Setting the derivative of the cost with respect to each parameter to zero gives the
normal equations — the conditions the best line must satisfy. For the
through-origin model y = wx, there is just one parameter, and the solution is
beautifully compact:
For the full line y = mx + c the normal equations give two formulas, but GATE’s
2025 question used the through-origin form above, so that is the one to have at your
fingertips.
How GATE asks this
The bread-and-butter form is a NAT: you are handed 3 to 5 points and asked
for the slope (or a prediction) of the least-squares fit. GATE DA 2025 asked
exactly this — fit y = wx to three points and report w. The recipe never
changes: form Σ xᵢ yᵢ and Σ xᵢ², then divide.
The occasional MCQ probes the concept — what least squares minimizes, or what
happens with an outlier.
Worked example — a real GATE DA 2025 NAT
Fit the model
y = wx(through the origin) to the points(−1, 1),(2, −5), and(3, 5)by least squares. Findw. (Real GATE DA 2025 question.)
Build the two sums column by column, then divide:
point x·y x²
(−1, 1) (−1)(1) = −1 (−1)² = 1
( 2, −5) (2)(−5) = −10 (2)² = 4
( 3, 5) (3)(5) = 15 (3)² = 9
───────────── ──────────
Σ x·y = −1 − 10 + 15 = 4 Σ x² = 1 + 4 + 9 = 14
Σ x·y 4
w = ─────── = ── = 0.2857… ≈ 0.286
Σ x² 14
So w ≈ 0.286. Notice the fit does not pass through any single point —
it balances all three so the squared vertical gaps are collectively smallest.
Quick check
Quick check
Practice this in an interview
All questionsOLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.
The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.
Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.