Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

Jacobian, Hessian & Taylor: slope, curvature, and Newton — Math for ML

What you'll learn

Taylor expansion: linear (slope) and quadratic (curvature) approximations of a function

The Jacobian: stacking the gradients of a vector-valued function — and why backprop is a product of them

The Hessian: the matrix of second derivatives, and what its eigenvalues say about minima, maxima, and saddles

The multivariate Taylor formula and how Newton's method uses curvature to take smarter steps

Why saddle points — not local minima — dominate deep-learning loss surfaces

Backprop left us with two loose ends. A layer outputs a vector, not a number, so its derivative is something larger than a gradient — and all through gradient descent we watched one coordinate sprint while another crawled, because the loss curved harder in one direction than the other and the gradient was blind to it. This lesson ties off both. Gradient descent knows just one thing about your loss — which way is downhill — but the surface is also curving, bending sharply in some directions and gently in others, and reading that curvature is the whole leap from first-order to second-order optimization. It begins with the Taylor expansion.

Taylor: approximate any function with a slope and a bend

Near a point x₀, any smooth function is well-approximated by:

f(x₀ + Δ) ≈ f(x₀) + f′(x₀)·Δ + ½·f″(x₀)·Δ²
            └ value ┘ └ slope ┘   └ curvature ┘

The first two terms give the tangent line (linearization). Add the third — the second derivative — and you get a parabola that bends with the curve.

expansion point x₀-1.5

f′(x₀) = -0.18 (slope / gradient)f″(x₀) = 1.09 (curvature / Hessian)

The tangent line only knows the slope, so it drifts off fast. Adding the second derivative (curvature) lets the parabola bend with the curve — that's why second-order methods converge faster.

Scaling up: gradient → Jacobian → Hessian

With many variables, each derivative becomes an array.

Gradient ∇f — for a scalar function, the vector of first partial derivatives. The direction of steepest ascent (you met it already).
Jacobian J — for a vector-valued function f: ℝⁿ → ℝᵐ, the m×n matrix whose rows are the gradients of each output. Backpropagation is nothing but multiplying Jacobians together via the chain rule, layer by layer.
Hessian H — the n×n matrix of second partials, Hᵢⱼ = ∂²f/∂xᵢ∂xⱼ. It encodes curvature in every direction at once. It’s symmetric, so its eigenvalues are real — and they tell you everything:

Hessian eigenvalues	Critical point
all positive	local minimum (bowl up)
all negative	local maximum (bowl down)
mixed signs	saddle point (up one way, down another)

The multivariate Taylor expansion ties them together:

f(x + Δ) ≈ f(x) + ∇f(x)ᵀ Δ + ½ Δᵀ H(x) Δ

Newton’s method: use the curvature

Gradient descent steps −η∇f and guesses the step size η. Newton’s method uses the Hessian to compute the ideal step: Δ = −H⁻¹∇f. On a quadratic bowl it lands on the minimum in a single jump.

import numpy as np

# A quadratic bowl  f(x) = ½ xᵀ A x − bᵀ x
A = np.array([[3.0, 0.5], [0.5, 1.0]])   # the (constant) Hessian
b = np.array([1.0, 2.0])
grad = lambda x: A @ x - b

x = np.array([4.0, -3.0])
print("start:", x)

# One Newton step uses the Hessian (here, A) -- lands on the minimum exactly
x_newton = x - np.linalg.solve(A, grad(x))
print("after ONE Newton step:", x_newton.round(4))

# Gradient descent needs many small steps and a tuned learning rate
x_gd = x.copy()
for _ in range(50):
    x_gd = x_gd - 0.25 * grad(x_gd)
print("after 50 GD steps:    ", x_gd.round(4))

start: [ 4. -3.]
after ONE Newton step: [0. 2.]
after 50 GD steps:     [0. 2.]

One Newton step lands exactly on the minimum [0, 2] — because on a quadratic, the second-order Taylor model is the function, so solving it once is solving the whole problem. Gradient descent reaches the same point, but only after all 50 small, learning-rate-tuned steps: it feels the slope and never the curvature. The catch is that the Hessian is n×n, so for a billion-parameter model it is unthinkable to form or invert — which is why deep learning uses first-order methods (Adam) plus approximations of curvature (L-BFGS, K-FAC) instead.

In one breath

Gradient descent reads only the slope, but a loss surface also curves, and the Taylor expansion captures both: f(x + Δ) ≈ f(x) + ∇fᵀΔ + ½ ΔᵀHΔ — value, then slope, then curvature. The slope climbs a ladder: the gradient (first derivatives of a scalar function), the Jacobian (the m×n matrix of a vector-valued function’s partials — and backprop is just a product of these, layer by layer), and the Hessian (the symmetric matrix of second partials, whose eigenvalues classify a critical point: all-positive = minimum, all-negative = maximum, mixed = saddle). Newton’s method uses the Hessian for the ideal step Δ = −H⁻¹∇f, exact in one jump on a quadratic — but it is n×n and so infeasible at deep-learning scale, where saddle points, not bad minima, dominate.

Practice

Quick check

0/3

Q1At a critical point (∇f = 0), the Hessian has eigenvalues [5, -2]. What kind of point is it?

Q2Why does backpropagation relate to Jacobians?

Q3Newton's method converges in one step on a quadratic but is rarely used for deep nets. Why?

A question to carry forward

The Hessian’s eigenvalues just gave us a classifier for a single point: all-positive means a bowl, mixed means a saddle. But that was a purely local verdict — it described the curvature exactly where we stood and said nothing about the rest of the surface. So ask the bolder question: what if a function’s Hessian were positive everywhere, in every direction, at every point?

Then something remarkable follows. The surface could never bend downward to open a second valley, so there would be exactly one bottom, and every downhill path — from any starting point — would have to arrive at it. No saddles to stall on, no bad local minima to fear. That property has a name, convexity, and it is the quiet dividing line between the losses that train effortlessly and the ones that need babysitting. Here is the thread onward: what exactly makes a function convex, how does the Hessian test for it in a single line, and why does knowing a loss is convex turn optimization from an art back into a guarantee?

Jacobian, Hessian & Taylor: slope, curvature, and Newton

What you'll learn

Before you start

Taylor: approximate any function with a slope and a bend

Scaling up: gradient → Jacobian → Hessian

Newton’s method: use the curvature

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further