datarekha

Multi-Layer Perceptron & Activations

An MLP stacks fully-connected layers with nonlinear activations. Counting its trainable parameters is a recurring GATE DA NAT — once with bias, once without.

9 min read Intermediate GATE DA Lesson 92 of 122

What you'll learn

  • An MLP stacks fully-connected layers; nonlinear activations stop them collapsing
  • A layer from a inputs to b units has a·b weights plus b biases
  • Counting total trainable parameters, with and without bias terms
  • Sigmoid, tanh, ReLU = max(0,x); ReLU is continuous but not differentiable at 0

Before you start

A multi-layer perceptron (MLP) is the plainest neural network: a stack of fully-connected layers, each one taking the previous layer’s outputs, mixing them with a weight matrix, adding a bias, and passing the result through a nonlinear activation. Logistic regression is a one-layer special case; the MLP just stacks more of them. Every deep network you will ever train — from a tabular classifier to the dense layers inside a transformer — is built from exactly these fully-connected-plus-activation blocks, so counting their parameters is the first thing you do when sizing a model to fit memory.

Layers, weights, and biases

A layer that maps a inputs to b units has a weight for every input-to-unit connection — a·b of them — plus one bias per unit, so b biases.

input (3)hidden (4)output (2)3×4 = 12 weights4×2 = 8 weights
Every arrow is one weight. Add one bias per unit in a layer if biases are used.

Total trainable parameters is the sum over layers. Per layer:

  • with bias: a·b + b
  • without bias: a·b

The activation between layers is what makes the stack expressive. Without a nonlinearity, two stacked linear layers W₂(W₁x) collapse into a single linear map (W₂W₁)x — no more powerful than one layer. The nonlinearity is what lets an MLP bend decision boundaries.

The classic choices: sigmoid (squashes to (0, 1)), tanh (squashes to (−1, 1)), and ReLU = max(0, x) — cheap to compute and largely free of the vanishing-gradient problem, but continuous everywhere yet NOT differentiable at x = 0 (the kink). Explore how each shapes its input and its gradient:

How GATE asks this

The signature question is a NAT: an architecture is given as a chain of layer sizes (e.g. 30 → 4 → 3 → 1) and you count the trainable parameters. The single thing that trips students is bias — the question states whether biases are included, and you must read it. MCQ/MSQ items test activation properties: ReLU’s non-differentiability at 0, and why a nonlinearity is needed at all.

Worked example — the no-bias and with-bias cases

Count layer by layer. The first network is a real GATE DA question; the second shows the with-bias variant on the same kind of architecture.

(GATE DA 2026) Network 30 → 4 → 3 → 1, no bias. Multiply consecutive layer sizes and add:

weights = 30·4 + 4·3 + 3·1
        = 120  + 12  + 3
        = 135

So 135 trainable parameters.

Same idea, now 5 → 10 → 3 with bias. Each layer adds one bias per output unit:

layer 1 (5 → 10): 5·10 + 10 = 50 + 10 = 60
layer 2 (10 → 3): 10·3 +  3 = 30 +  3 = 33
total = 60 + 33 = 93

So 93 trainable parameters.

Quick check

Quick check

0/6
Q1An MLP has architecture 4 → 6 → 2 with bias terms. How many trainable parameters does it have?numerical answer — type a number
Q2An MLP has architecture 10 → 5 → 1 with NO bias. How many trainable parameters?numerical answer — type a number
Q3Take the network 8 → 4 → 4 → 2. How many MORE parameters does it have WITH bias than WITHOUT bias?numerical answer — type a number
Q4Which statements about activation functions are TRUE? (select all that apply)select all that apply
Q5Why does an MLP need a nonlinear activation between its layers?
Q6A network is 100 → 50 → 10 with bias. Trainable parameters?numerical answer — type a number

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content