datarekha

Convexity & Single-Variable Optimization

A convex function is bowl-shaped: f''(x) ≥ 0 everywhere. The payoff — any local minimum is automatically global, and a strict bowl has at most one minimizer.

8 min read Advanced GATE DA Lesson 45 of 122

What you'll learn

  • Convex = graph lies below every chord (bowl-shaped); for twice-differentiable f, this is exactly f''(x) ≥ 0 everywhere
  • f'' > 0 everywhere ⇒ strictly convex; the payoff is that any local minimum is global
  • A strictly convex function has at most one minimizer — but it need not have one at all
  • Why convex loss functions are the well-behaved ones to optimize in ML

Before you start

So far, finding a minimum has meant solving f'(x) = 0 and praying it’s the right one. Plenty of functions have several dips, and “local minimum” is the most you can ever honestly claim. Convexity is the property that makes the worry disappear — a convex function is one big bowl, so the first dip you find is the dip. There’s nothing lower hiding elsewhere. It’s the reason so much of machine learning is built on convex loss functions.

What convex means

Picture a bowl. A function is convex if its graph lies below (or on) any chord — pick two points on the curve, draw the straight segment between them, and the curve never pokes above that segment. A non-convex curve has wiggles: it rises above some chords and dips into several separate valleys.

convex (bowl)chord stays above curveone minimumnon-convexcurve rises above chordtwo local minima
Convex: every chord sits above the curve, so there is a single valley. Non-convex: chords get crossed, and separate valleys appear.

The test you actually use: the second derivative

The chord definition is the meaning, but for a twice-differentiable function the working test is much simpler. The curve bends upward exactly when its slope is increasing, i.e. when the second derivative is non-negative:

f is convex  ⇔  f''(x) ≥ 0  for all x

f''(x) > 0 for all x   ⇒   f is strictly convex

So convexity is a statement about f'' everywhere, not at a single point. A parabola has f'' = 2 > 0, so it is strictly convex. So is (f'' = eˣ > 0). A line has f'' = 0, so it is convex but not strictly so.

Why convexity is the prize

Two consequences make convex functions the ones optimizers love:

  • Any local minimum is a global minimum. With no rival valleys, the moment you find a dip you are done — there is nothing lower hiding elsewhere. Gradient descent cannot get trapped in a bad local minimum, because there are none.
  • A strictly convex function has at most one minimizer. Two distinct lowest points would force a flat-or-bulging stretch between them, contradicting strict convexity. So the solution, if it exists, is unique.

This is exactly why linear regression (squared loss) and logistic regression have convex objectives: a single global optimum, reachable by simple descent.

Watch it happen. Drop the marker anywhere on the convex bowl below and run gradient descent — wherever you start, the path slides to the same point at the bottom. No “bad starts,” no second valleys to fall into. That’s the convexity payoff in one picture.

How GATE asks this

The signature question is an MSQ: “f is twice differentiable with f''(x) > 0 for all x — which of the following are always true?” You must tick the genuine consequences (at most one stationary point; any local min is global) and reject the tempting “f has a minimum,” which the counterexample kills. Occasionally it appears as an MCQ asking for the single guaranteed property, or a NAT pinning down the unique minimizer of a convex quadratic.

Worked example — a real GATE DA 2025 question

Let f be twice differentiable with f''(x) > 0 for all real x. Which of the following statements are always true?

  1. f is strictly convex.
  2. f has at most one point where f'(x) = 0.
  3. Any local minimum of f is a global minimum.
  4. f has a minimum.

Take the verdicts one at a time.

  • (1) TRUE. f'' > 0 everywhere is the definition of strictly convex.
  • (2) TRUE. f'' > 0 means f' is strictly increasing, so f' can cross zero at most once. Hence at most one stationary point.
  • (3) TRUE. For a convex function every local minimum is global — no lower valley can exist elsewhere.
  • (4) FALSE — the trap. Strict convexity does not force a minimum to exist. Take f(x) = eˣ: here f'' = eˣ > 0 for all x, so it satisfies the hypothesis, yet f' = eˣ is never zero and f has no minimum (it slides toward 0 as x → −∞ without attaining it).

So the always-true statements are (1), (2), and (3); statement (4) is false. This is a real GATE DA 2025 MSQ — the “f has a minimum” option is precisely the distractor most students tick.

Quick check

Quick check

0/6
Q1Which of these functions are convex on the whole real line? (select all that apply)select all that apply
Q2Find the value of x at which the convex function f(x) = x² − 4x + 7 attains its minimum. (NAT)numerical answer — type a number
Q3For the same f(x) = x² − 4x + 7, what is the minimum VALUE of f? (NAT)numerical answer — type a number
Q4Let f be twice differentiable with f''(x) > 0 for all x. Which statements are ALWAYS true? (select all that apply)select all that apply
Q5A student claims: 'My loss function has f'' > 0 everywhere, so gradient descent is guaranteed to converge to a minimum.' What is the flaw?
Q6Which statements about a CONVEX (not necessarily strictly convex) function are always true? (select all that apply)select all that apply

Practice this in an interview

All questions
What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content