datarekha

Jacobian, Hessian & Taylor: slope, curvature, and Newton

Gradient descent only uses the slope. But every loss surface also has curvature — and the Jacobian, the Hessian, and the Taylor expansion are how you read it. Curvature is the difference between crawling and converging.

9 min read Advanced Math for ML Lesson 15 of 30

What you'll learn

  • Taylor expansion: linear (slope) and quadratic (curvature) approximations of a function
  • The Jacobian: stacking the gradients of a vector-valued function — and why backprop is a product of them
  • The Hessian: the matrix of second derivatives, and what its eigenvalues say about minima, maxima, and saddles
  • The multivariate Taylor formula and how Newton's method uses curvature to take smarter steps
  • Why saddle points — not local minima — dominate deep-learning loss surfaces

Before you start

Gradient descent knows one thing about your loss: which way is downhill. But the surface is also curving — bending sharply in some directions, gently in others. Reading that curvature is the leap from first-order to second-order optimization, and it all starts with the Taylor expansion.

Taylor: approximate any function with a slope and a bend

Near a point x₀, any smooth function is well-approximated by:

f(x₀ + Δ) ≈ f(x₀) + f′(x₀)·Δ + ½·f″(x₀)·Δ²
            └ value ┘ └ slope ┘   └ curvature ┘

The first two terms give the tangent line (linearization). Add the third — the second derivative — and you get a parabola that bends with the curve.

Scaling up: gradient → Jacobian → Hessian

With many variables, each derivative becomes an array.

  • Gradient ∇f — for a scalar function, the vector of first partial derivatives. The direction of steepest ascent (you met it already).
  • Jacobian J — for a vector-valued function f: ℝⁿ → ℝᵐ, the m×n matrix whose rows are the gradients of each output. Backpropagation is nothing but multiplying Jacobians together via the chain rule, layer by layer.
  • Hessian H — the n×n matrix of second partials, Hᵢⱼ = ∂²f/∂xᵢ∂xⱼ. It encodes curvature in every direction at once. It’s symmetric, so its eigenvalues are real — and they tell you everything:
Hessian eigenvaluesCritical point
all positivelocal minimum (bowl up)
all negativelocal maximum (bowl down)
mixed signssaddle point (up one way, down another)

The multivariate Taylor expansion ties them together:

f(x + Δ) ≈ f(x) + ∇f(x)ᵀ Δ + ½ Δᵀ H(x) Δ

Newton’s method: use the curvature

Gradient descent steps −η∇f and guesses the step size η. Newton’s method uses the Hessian to compute the ideal step: Δ = −H⁻¹∇f. On a quadratic bowl it lands on the minimum in a single jump.

One Newton step nails it; gradient descent is still creeping in after 50. The catch: the Hessian is n×n, so for a billion-parameter model it’s unthinkable to form or invert — which is why deep learning uses first-order methods (Adam) plus approximations of curvature (L-BFGS, K-FAC) instead.

Quick check

Quick check

0/3
Q1At a critical point (∇f = 0), the Hessian has eigenvalues [5, -2]. What kind of point is it?
Q2Why does backpropagation relate to Jacobians?
Q3Newton's method converges in one step on a quadratic but is rarely used for deep nets. Why?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

Related lessons

Explore further

Skip to content