Convexity & Single-Variable Optimization
A convex function is bowl-shaped: f''(x) ≥ 0 everywhere. The payoff — any local minimum is automatically global, and a strict bowl has at most one minimizer.
What you'll learn
- Convex = graph lies below every chord (bowl-shaped); for twice-differentiable f, this is exactly f''(x) ≥ 0 everywhere
- f'' > 0 everywhere ⇒ strictly convex; the payoff is that any local minimum is global
- A strictly convex function has at most one minimizer — but it need not have one at all
- Why convex loss functions are the well-behaved ones to optimize in ML
Before you start
So far, finding a minimum has meant solving f'(x) = 0 and praying it’s the right
one. Plenty of functions have several dips, and “local minimum” is the most you can
ever honestly claim. Convexity is the property that makes the worry disappear —
a convex function is one big bowl, so the first dip you find is the dip. There’s
nothing lower hiding elsewhere. It’s the reason so much of machine learning is built
on convex loss functions.
What convex means
Picture a bowl. A function is convex if its graph lies below (or on) any chord — pick two points on the curve, draw the straight segment between them, and the curve never pokes above that segment. A non-convex curve has wiggles: it rises above some chords and dips into several separate valleys.
The test you actually use: the second derivative
The chord definition is the meaning, but for a twice-differentiable function the working test is much simpler. The curve bends upward exactly when its slope is increasing, i.e. when the second derivative is non-negative:
f is convex ⇔ f''(x) ≥ 0 for all x
f''(x) > 0 for all x ⇒ f is strictly convex
So convexity is a statement about f'' everywhere, not at a single point. A
parabola x² has f'' = 2 > 0, so it is strictly convex. So is eˣ (f'' = eˣ > 0).
A line has f'' = 0, so it is convex but not strictly so.
Why convexity is the prize
Two consequences make convex functions the ones optimizers love:
- Any local minimum is a global minimum. With no rival valleys, the moment you find a dip you are done — there is nothing lower hiding elsewhere. Gradient descent cannot get trapped in a bad local minimum, because there are none.
- A strictly convex function has at most one minimizer. Two distinct lowest points would force a flat-or-bulging stretch between them, contradicting strict convexity. So the solution, if it exists, is unique.
This is exactly why linear regression (squared loss) and logistic regression have convex objectives: a single global optimum, reachable by simple descent.
Watch it happen. Drop the marker anywhere on the convex bowl below and run gradient descent — wherever you start, the path slides to the same point at the bottom. No “bad starts,” no second valleys to fall into. That’s the convexity payoff in one picture.
How GATE asks this
The signature question is an MSQ: “f is twice differentiable with f''(x) > 0
for all x — which of the following are always true?” You must tick the genuine
consequences (at most one stationary point; any local min is global) and reject
the tempting “f has a minimum,” which the eˣ counterexample kills. Occasionally it
appears as an MCQ asking for the single guaranteed property, or a NAT pinning down
the unique minimizer of a convex quadratic.
Worked example — a real GATE DA 2025 question
Let
fbe twice differentiable withf''(x) > 0for all realx. Which of the following statements are always true?
fis strictly convex.fhas at most one point wheref'(x) = 0.- Any local minimum of
fis a global minimum.fhas a minimum.
Take the verdicts one at a time.
- (1) TRUE.
f'' > 0everywhere is the definition of strictly convex. - (2) TRUE.
f'' > 0meansf'is strictly increasing, sof'can cross zero at most once. Hence at most one stationary point. - (3) TRUE. For a convex function every local minimum is global — no lower valley can exist elsewhere.
- (4) FALSE — the trap. Strict convexity does not force a minimum to exist. Take
f(x) = eˣ: heref'' = eˣ > 0for allx, so it satisfies the hypothesis, yetf' = eˣis never zero andfhas no minimum (it slides toward0asx → −∞without attaining it).
So the always-true statements are (1), (2), and (3); statement (4) is false. This is a real GATE DA 2025 MSQ — the “f has a minimum” option is precisely the distractor most students tick.
Quick check
Quick check
Practice this in an interview
All questionsLogistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.
Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.
Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.
L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.