Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.

What does the Universal Approximation Theorem guarantee — and what doesn't it guarantee?

The theorem proves that a single-hidden-layer network with enough neurons and a non-linear activation can approximate any continuous function on a compact domain to arbitrary precision. It guarantees existence, not learnability — it says nothing about how many neurons are needed, whether gradient descent will find the solution, or how the network will generalize.

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

What is GELU and why does it outperform ReLU in transformer models?

GELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.

Differentiability — GATE DA

What you'll learn

f is differentiable at a when f'(a) exists — a single, unique tangent slope

Differentiable ⇒ continuous, but the converse fails (|x| and ReLU at 0)

Sums, products, quotients (denominator ≠ 0), and compositions of differentiable functions are differentiable

The ML tie-in: ReLU = max(0, x) is continuous everywhere but not differentiable at 0 (a real 2025 question)

Last lesson ended on a complaint about |x|: unbroken, yes, but kinked at the origin, with no single slope to call its own. The cure is a stronger kind of good behaviour, and it begins with one simple act — zoom in. Zoom in on a smooth curve far enough and it starts to look like a straight line. That line is the tangent, and its slope is the derivative.

A function is differentiable at a point when that zoom-in works — when one clean tangent line is waiting for you at the bottom of the zoom. Where it fails is exactly at the corners that continuity could not rule out. A sharp kink means the curve arrives with one slope from the left and leaves with another to the right, so no single tangent exists. The most famous example in deep learning sits right at the origin: the ReLU activation max(0, x) is continuous everywhere, yet its corner at 0 makes it non-differentiable there. GATE DA 2025 asked exactly this — so it is a fact worth understanding, not memorising.

Differentiable at a point

f is differentiable at a if the derivative

            f(a + h) − f(a)
f'(a) = lim ───────────────
        h→0        h

exists — meaning the limit returns one finite number whether h approaches 0 from the left or the right. Geometrically, the secant line settles onto a single tangent. Drag the point along the smooth curve below and animate h → 0 to watch the secant rotate into that one unique tangent slope — the same secant-settling you met in the limits lesson, now read as a slope.

TryDerivatives · slope of the tangent

Drag the point — read the slope off the tangent line

The secant line touches the curve at two points a distance h apart. Shrink h toward 0 and watch it rotate into the tangent: that limiting slope is the derivative f′(x).

f(x)tangent & secant

Drag the point along the curve.

f′(x)the derivative

Each drag plots (x, f′(x)) — it traces the derivative.

gap h0.800

0.001h → 02.0

x1.200

f(x)1.440

secant slope3.200(f(x+h)−f(x)) / h

f′(x) = tangent slope2.400the limit as h → 0

The secant slope is 3.200; the true derivative is 2.400. Shrink h and watch the gap close.

When the left-hand slope and the right-hand slope disagree, no single limit exists, and the function is not differentiable there — even when the graph itself is perfectly unbroken.

Left: a smooth curve has one tangent. Right: at the corner of |x| the left slope (−1) and right slope (+1) disagree, so it is not differentiable at 0.

The one-way implication

Here is the relationship GATE tests most, and it runs in one direction only:

differentiable at a   ⇒   continuous at a          (always)
continuous at a       ⇒   differentiable at a      (NOT in general)

If a function has a tangent slope it certainly cannot jump, so differentiability forces continuity — the stronger property carries the weaker one for free. The reverse fails, and |x| and ReLU = max(0, x) are why: continuous everywhere, yet their corner at 0 kills differentiability. Continuity is the weaker condition; differentiability is the stronger one, and that asymmetry is the single fact most of these questions turn on.

Building differentiable functions

Differentiability is preserved by the usual algebra, so you can certify a big expression from its small pieces. If f and g are both differentiable at a, then so are:

their sum f + g and difference f − g,
their product f · g,
their quotient f / g — provided g(a) ≠ 0,
their composition f(g(x)) (the chain rule).

So a product f·g is differentiable wherever both factors are, and a quotient inherits differentiability everywhere the denominator stays non-zero. The one place to stay alert is that quotient caveat — a vanishing denominator breaks it.

How GATE asks this

The 2025-era favourite is an MSQ: a list of functions or claims, “select all that are differentiable” or “select all true statements.” The recurring hooks are two. First, the ReLU / |x| corner — continuous but not differentiable at 0. Second, the one-way implication — differentiable ⇒ continuous, never the converse. GATE DA 2025 tested exactly the ReLU fact. Expect also combination questions: sums, products, and compositions of differentiable functions stay differentiable.

Worked example — ReLU at the origin (GATE DA 2025)

Is ReLU(x) = max(0, x) continuous at x = 0? Is it differentiable there?

Write ReLU as a piecewise function and inspect the seam at 0:

ReLU(x) = 0   for x ≤ 0,     ReLU(x) = x   for x > 0

Continuity at 0: the left piece gives 0, the right limit gives 0, and ReLU(0) = 0 — all three agree, so by last lesson’s test ReLU is continuous at 0 (and everywhere).

Differentiability at 0: compare the one-sided slopes.

left slope  (x < 0):  d/dx [0] = 0
right slope (x > 0):  d/dx [x] = 1

0  ≠  1   →   the slopes disagree   →   NOT differentiable at x = 0

So ReLU is continuous everywhere but not differentiable at x = 0 — the corner, exactly. The identical reasoning applies to |x|: left slope −1, right slope +1, so it too has a non-differentiable corner at 0. Away from 0, both are perfectly differentiable. And for a combination such as h(x) = f(x)·g(x), h is differentiable wherever both f and g are — so x · sin x, a product of two everywhere-differentiable pieces, is differentiable for all real x.

A question to carry forward

We can now say when a slope exists — when the zoom-in lands on one tangent. But knowing a derivative exists is not the same as having it in hand, and grinding the limit (f(a+h) − f(a))/h by hand for every function would be punishing. Already, though, a pattern is showing: the slope of x is 1, the slope of a constant is 0, and the building rules above hint that products and compositions follow tidy laws of their own. Here is the thread onward: is there a compact set of shortcuts — for powers, products, quotients, and compositions — that hands you any derivative in seconds, with the limit definition tucked safely out of sight?

In one breath

Differentiable at a = the limit f'(a) = lim_{h→0} (f(a+h) − f(a))/h exists — one finite slope from both sides, a single tangent at the bottom of the zoom.
A corner (mismatched left/right slopes) kills it: |x| and ReLU = max(0,x) are continuous yet not differentiable at 0 (GATE DA 2025).
One-way implication: differentiable ⇒ continuous (always); continuous ⇒ differentiable is false in general. Differentiability is the stronger property.
Building rules: sums, differences, products, compositions (chain rule) of differentiable functions are differentiable; quotients too, wherever g ≠ 0.
Reflex: a smooth graph is more than an unbroken one — hunt for corners and zero denominators before declaring “differentiable.”

Practice

Quick check

0/6

Q1Recall: which statements are TRUE? (select all that apply)select all that apply

Q2Recall: the left-hand derivative of f at a is 2 and the right-hand derivative is 2. Is f differentiable at a, and if so what is f'(a)? Enter f'(a). (integer)numerical answer — type a number

Q3Trace: compute the right-hand derivative of ReLU(x) = max(0, x) at x = 0 (the slope for x > 0). (integer)numerical answer — type a number

Q4Apply: at how many points is f(x) = |x − 3| NOT differentiable? (integer)numerical answer — type a number

Q5Apply: which of these functions are differentiable for ALL real x? (select all that apply)select all that apply

Q6Create: which statements about combining differentiable functions are TRUE? (select all that apply)select all that apply

Differentiability

What you'll learn

Before you start