Convexity: why some losses are easy and others aren't
"Convex loss" is the quiet reason classical ML just works and deep nets need babysitting. Convexity is the guarantee that downhill always leads home — and you can test for it with the Hessian.
What you'll learn
- What makes a set and a function convex (the chord-above-the-curve picture)
- Why convexity is the dividing line: any local minimum is the global minimum
- The Hessian test — convex iff the Hessian is positive semi-definite
- Which ML losses are convex (linear, ridge, logistic, SVM) and which aren't (neural nets)
- Why L2 regularization makes a problem more convex and better-behaved
Before you start
When people say a loss is convex, they’re making a promise: every path downhill ends at the same bottom. That one guarantee is why linear and logistic regression train reliably from any starting point, while deep nets need careful initialization, learning-rate schedules, and a bit of luck.
What “convex” means
A convex set is one with no dents: the straight line between any two points in it stays inside. A convex function is shaped like a bowl — formally, the chord connecting any two points on the graph lies on or above the curve:
f(t·a + (1−t)·b) ≤ t·f(a) + (1−t)·f(b) for all t in [0, 1]
The consequence is everything: a convex function has no local minima other than the global one. There is nowhere to get stuck.
Convex: every start lands at the bottom. Non-convex: the answer depends on where you began — the same algorithm gives different, sometimes worse, results.
The Hessian test
You don’t have to eyeball it. A twice-differentiable function is convex
exactly when its Hessian is positive semi-definite (all eigenvalues
≥ 0) everywhere. In 1D that’s just f″(x) ≥ 0 — the curve only ever bends
upward.
Which losses are convex?
- Convex (train reliably): linear regression (MSE), ridge & lasso, logistic regression, softmax/cross-entropy in the linear case, SVM hinge loss. For these, gradient descent — or a convex solver — finds the optimum, full stop.
- Non-convex (the hard, interesting world): anything with a hidden layer. Stacking non-linear activations destroys convexity, so neural nets have countless local minima and saddles. We don’t find the global optimum; we find a good enough one — and remarkably, in high dimensions, good-enough is usually plenty.
Quick check
Quick check
Practice this in an interview
All questionsLogistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.
MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.
Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.
Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.