Why do neural networks need activation functions at all?

Without a non-linear activation, any stack of linear layers collapses to a single linear transformation, giving a model no more expressive than logistic regression. Activation functions break linearity so the network can approximate arbitrarily complex functions.

What does a single artificial neuron (perceptron) actually compute?

A neuron takes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. The weights encode learned feature importance, the bias shifts the decision boundary, and the activation introduces the non-linearity needed for complex mappings.

Walk me through the forward pass of a neural network end-to-end.

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.

Multi-Layer Perceptron & Activations — GATE DA

What you'll learn

An MLP stacks fully-connected layers; nonlinear activations stop them collapsing

A layer from a inputs to b units has a·b weights plus b biases

Counting total trainable parameters, with and without bias terms

Sigmoid, tanh, ReLU = max(0,x); ReLU is continuous but not differentiable at 0

Last lesson left the lone perceptron defeated by XOR, with the cure already named: feed several neurons into another neuron, building a hidden layer. Do exactly that and you have a multi-layer perceptron (MLP) — the plainest neural network, and the architecture that ended the XOR impasse and launched deep learning. It is a stack of fully-connected layers, each one taking the previous layer’s outputs, mixing them with a weight matrix, adding a bias, and passing the result through a nonlinear activation. Logistic regression is just the one-layer special case; the MLP stacks more.

But the perceptron’s other parting warning was sharp: the hard sign step has to go. And there is a deeper reason than its flat gradient. Stack two linear layers with nothing between them and W₂(W₁x) collapses into a single linear map (W₂W₁)x — no more powerful than one layer, XOR still unsolved. The nonlinear activation between layers is what stops the collapse, and it is the smooth heir to the perceptron’s sign. With it, an MLP can bend boundaries into any shape; without it, depth buys nothing. Every deep network you will train — a tabular classifier, the dense blocks inside a transformer — is built from these fully-connected-plus-activation pairs, so counting their parameters is the first thing you do when sizing a model to fit in memory.

Layers, weights, and biases

A layer that maps a inputs to b units has a weight for every input-to-unit connection — a·b of them — plus one bias per unit, so b biases.

Every arrow is one weight. Add one bias per unit in a layer if biases are used.

Total trainable parameters is the sum over layers. Per layer:

with bias: a·b + b
without bias: a·b

The classic activation choices: sigmoid (squashes to (0, 1)), tanh (squashes to (−1, 1)), and ReLU = max(0, x) — cheap to compute and largely free of the vanishing-gradient problem, but continuous everywhere yet NOT differentiable at x = 0 (the kink). Explore how each shapes its input and its gradient:

TryActivations · function & gradient

It's the gradient that makes or breaks an activation

Each activation plots its value f(x) and its derivative f′(x) on the same axes. Drag the input x marker into the tails of Sigmoid/Tanh and watch the gradient collapse toward 0; on ReLU, slide it below 0 into the dead zone. That gradient story is why ReLU, GELU, and SiLU took over deep nets.

f(x)f′(x) · gradientdrag the x handle

Saturating — gradient vanishes in the tails

at x =2.4

f(x)0.917

f′(x) · gradient0.076

formulaσ(x) = 1 / (1 + e⁻ˣ)

range(0, 1)

saturates?yes

zero-centered?no

dead zone?no

Squashes to (0, 1). Both tails saturate, so f′ collapses toward 0 — gradients vanish and deep stacks barely learn. Lives on now as a binary-output activation, not a hidden one.

How GATE asks this

The signature question is a NAT: an architecture is given as a chain of layer sizes (e.g. 30 → 4 → 3 → 1) and you count the trainable parameters. The one thing that trips students is bias — the question states whether biases are included, and you must read it. MCQ/MSQ items test activation properties: ReLU’s non-differentiability at 0, and why a nonlinearity is needed at all.

Worked example — the no-bias and with-bias cases

Count the trainable parameters of two networks: 30 → 4 → 3 → 1 with no bias (a real GATE DA 2026 question), and 5 → 10 → 3 with bias.

(GATE DA 2026) Network 30 → 4 → 3 → 1, no bias. Multiply consecutive layer sizes and add:

weights = 30·4 + 4·3 + 3·1
        = 120  + 12  + 3
        = 135

So 135 trainable parameters.

Same idea, now 5 → 10 → 3 with bias. Each layer adds one bias per output unit:

layer 1 (5 → 10): 5·10 + 10 = 50 + 10 = 60
layer 2 (10 → 3): 10·3 +  3 = 30 +  3 = 33
total = 60 + 33 = 93

So 93 trainable parameters. The same logic, written once as a function:

def params(sizes, bias=True):
    total = 0
    for a, b in zip(sizes, sizes[1:]):     # consecutive layer sizes
        total += a * b + (b if bias else 0)
    return total

print("2026  30->4->3->1, no bias:", params([30, 4, 3, 1], bias=False))
print("with bias  5->10->3       :", params([5, 10, 3],    bias=True))

2026  30->4->3->1, no bias: 135
with bias  5->10->3       : 93

In one breath

A multi-layer perceptron stacks fully-connected layers — a layer from a inputs to b units holds a·b weights plus b biases — and inserts a nonlinear activation (sigmoid → (0,1), tanh → (−1,1), ReLU = max(0,x)) between them, without which the whole stack collapses to a single linear map; total trainable parameters are the sum over layers of a·b (+ b if biased), and the one activation fact GATE checks is that ReLU is continuous everywhere but not differentiable at 0.

Practice

Quick check

0/6

Q1Recall — Which statements about activation functions are TRUE? (select all that apply)select all that apply

Q2Recall — Why does an MLP need a nonlinear activation between its layers?

Q3Trace — An MLP has architecture 4 → 6 → 2 with bias terms. How many trainable parameters does it have?numerical answer — type a number

Q4Trace — An MLP has architecture 10 → 5 → 1 with NO bias. How many trainable parameters?numerical answer — type a number

Q5Trace — Take the network 8 → 4 → 4 → 2. How many MORE parameters does it have WITH bias than WITHOUT bias?numerical answer — type a number

Q6Apply — A network is 100 → 50 → 10 with bias. Trainable parameters?numerical answer — type a number

A question to carry forward

So an MLP with a hidden layer and a smooth activation can carve XOR, and any region you like — and it can carry thousands of weights doing it. Which raises the obstacle that stalled neural networks for two decades: how do you train all those weights? Gradient descent needs ∂L/∂w for every single one, and the loss reaches each weight only after threading through layer upon layer of mixing and squashing.

Computing those gradients one weight at a time, from scratch, would be hopeless. But the chain rule from calculus suggests a shortcut — if you knew the gradient at a layer’s output, could you cheaply push it back to the layer’s inputs, and keep pushing, layer by layer, all the way to the first weight? Here is the thread onward: how does walking the chain rule backward through the network’s computation graph hand you every weight’s gradient in a single sweep — and what is the one local derivative the ReLU contributes as the gradient passes through it?

Multi-Layer Perceptron & Activations

What you'll learn

Before you start

Layers, weights, and biases

It's the gradient that makes or breaks an activation

How GATE asks this

Worked example — the no-bias and with-bias cases

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further