Why is linear regression unsuitable for binary classification, and what specific problems does logistic regression fix?

Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.

Explain the relationship between the sigmoid function, odds, and log-odds in logistic regression.

Logistic regression models log-odds as a linear function of the features. Exponentiating the coefficients gives odds ratios, and applying the sigmoid to the linear score converts it to a probability. These three representations are equivalent reformulations of the same model.

What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Logistic Regression — GATE DA

What you'll learn

Logistic regression is classification, not regression — the output is a probability in (0,1)

The sigmoid σ(z) = 1/(1 + e⁻ᶻ) maps the linear score z = wᵀx + b to a probability

The decision boundary is the line z = 0, where σ = 0.5 — linear in x

It is trained with log-loss / cross-entropy, not squared error

Last lesson left every metric waiting for a model to grade — a model that emits a class, or better, a score you can threshold, the very thing the ROC curve sweeps. We have built six regressors, and not one of them does that; each predicts a bare number on the open line. So we need to take that number and bend it into a probability of being positive, a value safely between 0 and 1. The model that does exactly this is, despite its name, the first true classifier — and the name is the trap.

Logistic regression is a classifier. It earns “regression” only because, under the hood, it first computes a plain linear score z = wᵀx + b — the same weighted sum of features you have used since the very first regression lesson — and then bends that score into a probability. Two stages, then: a familiar linear score, followed by a squashing function that turns any real z into a probability in (0, 1). That squasher is the sigmoid, and it is the whole reason the linear machinery now works for classification. It remains the default first classifier in industry — fast, interpretable, and the baseline a neural network has to justify beating.

The sigmoid turns a score into a probability

The score z = wᵀx + b can be any real number, large positive or large negative. We need a probability in (0, 1). The sigmoid delivers exactly that:

The sigmoid is an S-curve: large positive z → near 1, large negative z → near 0, and exactly 0.5 at z = 0.

Three facts GATE leans on, read straight off the curve:

As z grows large and positive, e⁻ᶻ → 0, so σ(z) → 1.
As z grows large and negative, e⁻ᶻ → ∞, so σ(z) → 0.
At z = 0, e⁻ᶻ = 1, so σ(0) = 1 / (1 + 1) = 0.5 exactly.

The output σ(z) is read as P(y = 1 | x) — the model’s estimated probability that the point belongs to the positive class. That is precisely the score the ROC curve from last lesson was sweeping.

The decision boundary is z = 0

To turn a probability into a class, threshold at 0.5: predict positive when σ(z) ≥ 0.5, negative otherwise. But σ(z) = 0.5 happens exactly when z = 0, so the decision boundary is the set of points where wᵀx + b = 0. That is a straight line — a hyperplane in higher dimensions — so the boundary is linear in x, even though the sigmoid mapping itself is curved.

Drag the boundary below to separate the two classes by hand, then press Fit to watch the model find the separator for you:

Trydecision boundary · drag the line

Drag the line to separate the classes — then let gradient descent fit it

class 0 class 1shaded = P(class 1)

Accuracy75.0%6 of 24 misclassified

w₁0.15

w₂1.00

b-7.20

correct18 / 24

The line is where the model is 50/50. The shading is the sigmoid: points deep in a colour are confident, points near the line are uncertain. Classification is just finding the line that best splits the two classes.

The model is trained with log-loss (cross-entropy), −[y log p + (1 − y) log(1 − p)], not squared error. Log-loss punishes a confident wrong prediction — say p = 0.99 when the true label is 0 — far more harshly, which is exactly what you want from a probability model, and it keeps the optimisation convex and well-behaved.

How GATE asks this

Usually an MCQ probing one of three things: evaluate the sigmoid at a given score (often a NAT, with the relevant e value supplied), identify the decision boundary (the answer is the linear equation wᵀx + b = 0, never a curve), or name the loss (cross-entropy / log-loss, never mean squared error). A favourite distractor claims logistic regression outputs a continuous quantity like linear regression — it does not; it outputs a class probability.

Worked example — evaluate the sigmoid

A logistic model produces score z for a point. Find σ(z) for z = 0, z = 2, and z = −2. Use e⁻² ≈ 0.135. Which class is z = 2?

Apply σ(z) = 1 / (1 + e⁻ᶻ) term by term:

σ(0)  = 1 / (1 + e⁰)    = 1 / (1 + 1)     = 0.5      ← on the boundary
σ(2)  = 1 / (1 + e⁻²)   = 1 / (1 + 0.135) = 1/1.135  ≈ 0.881
σ(−2) = 1 / (1 + e²)    = 1 / (1 + 7.389) = 1/8.389  ≈ 0.119

A quick shortcut to check σ(−2): the sigmoid is symmetric, σ(−z) = 1 − σ(z), so σ(−2) = 1 − 0.881 = 0.119. ✓ — and indeed σ(2) + σ(−2) ≈ 1, just as the prediction prompt hinted.

Since σ(2) ≈ 0.881 > 0.5, the point with z = 2 is classified positive, with about 88% confidence. The point with z = −2 would be classified negative (only about a 12% chance of being positive).

In one breath

Logistic regression is a classifier that computes the familiar linear score z = wᵀx + b, then squashes it through the sigmoid σ(z) = 1/(1 + e⁻ᶻ) into a probability P(y=1|x) in (0, 1) — near 1 for big positive z, near 0 for big negative z, exactly 0.5 at z = 0; thresholding at 0.5 puts the decision boundary at wᵀx + b = 0, a straight line linear in x, and the model is trained by minimising log-loss (cross-entropy), not squared error, because log-loss harshly penalises confident wrong probabilities.

Practice

Quick check

0/6

Q1Recall — Which statements about logistic regression are TRUE? (select all that apply)select all that apply

Q2Recall — What is the score z that makes σ(z) = 0.5?numerical answer — type a number

Q3Trace — A logistic regression model computes a score z = 2 for a sample. Given e⁻² ≈ 0.135, what probability σ(z) does it assign to the positive class? (3 decimals)numerical answer — type a number

Q4Trace — A model outputs σ(z) = 0.881 for the positive class. By the sigmoid's symmetry σ(−z) = 1 − σ(z), what probability would a sample with score −z receive for the positive class? (3 decimals)numerical answer — type a number

Q5Apply — For a logistic model with weights w = (1, 2) and bias b = −5, the decision boundary is the set of points (x₁, x₂) satisfying which equation?

Q6Apply — Why is squared error a poor loss for logistic regression compared with log-loss?

A question to carry forward

Logistic regression hands us a real classifier at last — but look at the shape of what it learned: a single straight line, fixed once, drawn from all the data at once. That global commitment is its strength (interpretable, fast) and its cage. Hand it two classes coiled around each other like a spiral, and no single straight cut can separate them.

So swing to the opposite extreme. What if a classifier learned nothing in advance — no weights, no boundary, no training at all — and instead, to label a brand-new point, simply looked at the handful of training points sitting closest to it and took a vote? Here is the thread onward: can “you are like your neighbours” be a whole classification algorithm, what exactly does closest mean, and what new dial — the number of neighbours you poll — quietly slides you right back along the bias-variance curve?

Logistic Regression

What you'll learn

Before you start

The sigmoid turns a score into a probability

The decision boundary is z = 0

Drag the line to separate the classes — then let gradient descent fit it

How GATE asks this

Worked example — evaluate the sigmoid

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further