What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

Convexity: why some losses are easy and others aren't — Math for ML

What you'll learn

What makes a set and a function convex (the chord-above-the-curve picture)

Why convexity is the dividing line: any local minimum is the global minimum

The Hessian test — convex iff the Hessian is positive semi-definite

Which ML losses are convex (linear, ridge, logistic, SVM) and which aren't (neural nets)

Why L2 regularization makes a problem more convex and better-behaved

The last lesson ended on a “what if”: what if the Hessian were positive in every direction at every point? Here is the payoff. When people say a loss is convex, they are making exactly that promise in plain words — every path downhill ends at the same bottom. That single guarantee is why linear and logistic regression train reliably from any starting point, while deep nets need careful initialization, learning-rate schedules, and a little luck.

What “convex” means

A convex set is one with no dents: the straight line between any two points in it stays inside. A convex function is shaped like a bowl — formally, the chord connecting any two points on the graph lies on or above the curve:

f(t·a + (1−t)·b) ≤ t·f(a) + (1−t)·f(b)     for all t in [0, 1]

The consequence is everything: a convex function has no local minima other than the global one. There is nowhere to get stuck.

start x-4.15

Trapped in a local minimum. The same algorithm, a different start, a worse answer — the curse of non-convexity.

Convex: every start lands at the bottom. Non-convex: the answer depends on where you began — the same algorithm gives different, sometimes worse, results.

The Hessian test

You don’t have to eyeball it. A twice-differentiable function is convex exactly when its Hessian is positive semi-definite (all eigenvalues ≥ 0) everywhere. In 1D that’s just f″(x) ≥ 0 — the curve only ever bends upward.

import numpy as np

# Hessian of a quadratic ½xᵀAx is just A. Convex iff A is PSD (eigenvalues >= 0).
def is_convex(A):
    return np.all(np.linalg.eigvalsh(A) >= -1e-9)

bowl   = np.array([[3.0, 0.5], [0.5, 2.0]])   # eigenvalues > 0 -> convex
saddle = np.array([[1.0, 0.0], [0.0, -1.0]])  # one negative -> not convex

print("bowl   eigenvalues:", np.linalg.eigvalsh(bowl).round(2),  "convex?", is_convex(bowl))
print("saddle eigenvalues:", np.linalg.eigvalsh(saddle).round(2), "convex?", is_convex(saddle))

bowl   eigenvalues: [1.79 3.21] convex? True
saddle eigenvalues: [-1.  1.] convex? False

The bowl’s eigenvalues are both positive, so it curves up in every direction — convex. The saddle has one negative eigenvalue, a single direction that bends down, and that lone direction is enough to disqualify it. Convexity is unforgiving: it must hold along every direction at once.

Which losses are convex?

Convex (train reliably): linear regression (MSE), ridge & lasso, logistic regression, softmax/cross-entropy in the linear case, SVM hinge loss. For these, gradient descent — or a convex solver — finds the optimum, full stop.
Non-convex (the hard, interesting world): anything with a hidden layer. Stacking non-linear activations destroys convexity, so neural nets have countless local minima and saddles. We don’t find the global optimum; we find a good enough one — and remarkably, in high dimensions, good-enough is usually plenty.

In one breath

A function is convex when the chord between any two points on its graph lies on or above the curve — a single bowl with no local minimum other than the global one, so downhill always leads home. The one-line test: a twice-differentiable function is convex exactly when its Hessian is positive semi-definite (all eigenvalues ≥ 0) everywhere; in 1D that is just f″(x) ≥ 0. Linear/ridge/lasso regression, logistic regression, and the SVM hinge loss are convex (gradient descent finds the optimum, full stop); anything with a hidden layer is not, so neural nets settle for a good-enough minimum among countless saddles. Adding λ‖w‖² (L2) lifts every Hessian eigenvalue by 2λ, nudging the loss more convex and better-conditioned.

Practice

Quick check

0/3

Q1A loss is convex. You run gradient descent twice from two very different starting points. What can you say about the results?

Q2How do you test whether a twice-differentiable function is convex?

Q3Why does adding λ‖w‖² (L2) make optimization better-behaved beyond reducing overfitting?

A question to carry forward

Convexity hands you a guarantee — but look hard at the situation it guarantees. Gradient descent, Newton, every method so far has been free to wander anywhere in parameter space, chasing the bottom with no fences in the way. Real problems are rarely so unbounded. You minimise a cost subject to a budget; you fit weights but keep them inside an L2 ball (which is exactly what ridge regularization is, seen from another angle); you estimate a probability distribution whose entries must sum to 1; you maximise variance along a direction constrained to unit length — that was PCA. The valley you want can sit outside the region you are allowed to enter.

So here is the thread onward, and it closes the calculus chapter: how do you find the lowest point of a function when you are not free to go anywhere — when the answer must lie on a constraint surface? The trick is quietly beautiful: at the constrained optimum, the function’s gradient lines up exactly with the constraint’s gradient, and the number that links them — the Lagrange multiplier — turns out to mean something concrete, the “price” of the constraint. What is that method, and how does a single multiplier convert a hard constrained problem back into an ordinary one you already know how to solve?

Convexity: why some losses are easy and others aren't

What you'll learn

Before you start

What “convex” means

The Hessian test

Which losses are convex?

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further