Multi-Layer Perceptron & Activations
An MLP stacks fully-connected layers with nonlinear activations. Counting its trainable parameters is a recurring GATE DA NAT — once with bias, once without.
What you'll learn
- An MLP stacks fully-connected layers; nonlinear activations stop them collapsing
- A layer from a inputs to b units has a·b weights plus b biases
- Counting total trainable parameters, with and without bias terms
- Sigmoid, tanh, ReLU = max(0,x); ReLU is continuous but not differentiable at 0
Before you start
A multi-layer perceptron (MLP) is the plainest neural network: a stack of fully-connected layers, each one taking the previous layer’s outputs, mixing them with a weight matrix, adding a bias, and passing the result through a nonlinear activation. Logistic regression is a one-layer special case; the MLP just stacks more of them. Every deep network you will ever train — from a tabular classifier to the dense layers inside a transformer — is built from exactly these fully-connected-plus-activation blocks, so counting their parameters is the first thing you do when sizing a model to fit memory.
Layers, weights, and biases
A layer that maps a inputs to b units has a weight for every input-to-unit
connection — a·b of them — plus one bias per unit, so b biases.
Total trainable parameters is the sum over layers. Per layer:
- with bias:
a·b + b - without bias:
a·b
The activation between layers is what makes the stack expressive. Without a
nonlinearity, two stacked linear layers W₂(W₁x) collapse into a single linear
map (W₂W₁)x — no more powerful than one layer. The nonlinearity is what lets
an MLP bend decision boundaries.
The classic choices: sigmoid (squashes to (0, 1)), tanh (squashes to
(−1, 1)), and ReLU = max(0, x) — cheap to compute and largely free of the
vanishing-gradient problem, but continuous everywhere yet NOT differentiable
at x = 0 (the kink). Explore how each shapes its input and its gradient:
How GATE asks this
The signature question is a NAT: an architecture is given as a chain of
layer sizes (e.g. 30 → 4 → 3 → 1) and you count the trainable parameters.
The single thing that trips students is bias — the question states whether
biases are included, and you must read it. MCQ/MSQ items test activation
properties: ReLU’s non-differentiability at 0, and why a nonlinearity is
needed at all.
Worked example — the no-bias and with-bias cases
Count layer by layer. The first network is a real GATE DA question; the second shows the with-bias variant on the same kind of architecture.
(GATE DA 2026) Network 30 → 4 → 3 → 1, no bias. Multiply consecutive
layer sizes and add:
weights = 30·4 + 4·3 + 3·1
= 120 + 12 + 3
= 135
So 135 trainable parameters.
Same idea, now 5 → 10 → 3 with bias. Each layer adds one bias per
output unit:
layer 1 (5 → 10): 5·10 + 10 = 50 + 10 = 60
layer 2 (10 → 3): 10·3 + 3 = 30 + 3 = 33
total = 60 + 33 = 93
So 93 trainable parameters.
Quick check
Quick check
Practice this in an interview
All questionsWithout a non-linear activation, any stack of linear layers collapses to a single linear transformation, giving a model no more expressive than logistic regression. Activation functions break linearity so the network can approximate arbitrarily complex functions.
A neuron takes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. The weights encode learned feature importance, the bias shifts the decision boundary, and the activation introduces the non-linearity needed for complex mappings.
The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.
Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.