Perceptron & the Update Rule
The original neural unit: predict sign(wᵀx + b), then nudge the weights toward every misclassified point until the classes are separated.
What you'll learn
- The perceptron predicts the sign of the linear score: ŷ = sign(wᵀx + b)
- It learns by the update rule w ← w + η(y − ŷ)x, applied only to misclassified points
- Each update rotates the decision boundary toward classifying the missed point correctly
- It converges only if the data is linearly separable — a single layer cannot solve XOR
Before you start
The perceptron is the original artificial neuron — and the ancestor of every neural
network. Its prediction is brutally simple: compute the linear score z = wᵀx + b,
then output its sign. If z is positive, predict +1; if negative, predict
−1. The boundary z = 0 is a line, just like logistic regression — but instead of
a smooth probability, the perceptron commits to a hard ±1.
What makes it historically important is how it learns. There is no calculus and no probability — just a tiny correction applied every time it gets a point wrong, repeated until it stops making mistakes (if it can).
Predict with a sign, learn from mistakes
The prediction:
ŷ = sign(wᵀx + b) → +1 if wᵀx + b > 0, −1 otherwise
The learning rule walks through the training points. When the prediction ŷ
matches the true label y, do nothing. When it is wrong, nudge the weights:
The key is the error term (y − ŷ). When the prediction is right, y − ŷ = 0 and
the weights don’t move. When it is wrong — say y = +1 but ŷ = −1 —
y − ŷ = +2, so we add a multiple of x to w. That pushes the score wᵀx
up for this exact point, dragging it toward the positive side. Geometrically, each
update rotates the decision boundary toward correctly classifying the point it
just missed.
This repeats over the data. The Perceptron Convergence Theorem guarantees the process halts with zero errors — but only if the classes are linearly separable. If no straight line can separate them, the perceptron never settles.
How GATE asks this
Almost always an MCQ or NAT asking for the effect of a single update: given
w, b, a misclassified point x, its true label y, and the learning rate η,
compute the new weights or show that the score wᵀx moves toward the correct
side. This single-neuron update is also the building block behind the neural-network
questions GATE DA has asked every year (2024–2026). A conceptual variant asks why a
single-layer perceptron cannot learn XOR (the answer: XOR is not linearly
separable).
Worked example — one update step
Current weights
w = (1, 0), bias ignored for this step. A pointx = (2, 1)has true labely = +1, but the perceptron currently predictsŷ = −1. With learning rateη = 1, find the new weights and check thatwᵀximproves.
First confirm it’s a mistake. Old score:
wᵀx = (1)(2) + (0)(1) = 2
Wait — wᵀx = 2 > 0, so this would normally predict +1. We are told the model
predicted ŷ = −1 (e.g. the bias b = −3 made z = 2 − 3 = −1 < 0). Either way the
point is misclassified, so apply the update. The error term is
y − ŷ = +1 − (−1) = 2:
w_new = w + η·(y − ŷ)·x
= (1, 0) + 1 · 2 · (2, 1)
= (1, 0) + (4, 2)
= (5, 2)
Now recompute the score for the same point with the new weights:
w_newᵀx = (5)(2) + (2)(1) = 10 + 2 = 12
The score jumped from 2 to 12 — much more strongly positive. The update pushed
this point firmly onto the correct (positive) side, exactly as intended. The weight
vector grew in the direction of x, which is what rotates the boundary toward the
missed point.
Quick check
Quick check
Practice this in an interview
All questionsA neuron takes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. The weights encode learned feature importance, the bias shifts the decision boundary, and the activation introduces the non-linearity needed for complex mappings.
The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.
Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.
Dropout randomly zeroes each neuron's output with probability p during training, forcing the network to learn redundant representations and preventing co-adaptation of neurons. At inference, dropout is disabled and all neurons are active — but to keep expected activations the same as during training, outputs are scaled by 1/(1−p). Forgetting to switch modes produces incorrect, noisy predictions.