datarekha

Perceptron & the Update Rule

The original neural unit: predict sign(wᵀx + b), then nudge the weights toward every misclassified point until the classes are separated.

7 min read Intermediate GATE DA Lesson 91 of 122

What you'll learn

  • The perceptron predicts the sign of the linear score: ŷ = sign(wᵀx + b)
  • It learns by the update rule w ← w + η(y − ŷ)x, applied only to misclassified points
  • Each update rotates the decision boundary toward classifying the missed point correctly
  • It converges only if the data is linearly separable — a single layer cannot solve XOR

Before you start

The perceptron is the original artificial neuron — and the ancestor of every neural network. Its prediction is brutally simple: compute the linear score z = wᵀx + b, then output its sign. If z is positive, predict +1; if negative, predict −1. The boundary z = 0 is a line, just like logistic regression — but instead of a smooth probability, the perceptron commits to a hard ±1.

What makes it historically important is how it learns. There is no calculus and no probability — just a tiny correction applied every time it gets a point wrong, repeated until it stops making mistakes (if it can).

Predict with a sign, learn from mistakes

The prediction:

ŷ = sign(wᵀx + b)   →   +1 if wᵀx + b > 0,   −1 otherwise

The learning rule walks through the training points. When the prediction ŷ matches the true label y, do nothing. When it is wrong, nudge the weights:

w ← w + η (y − ŷ) xb ← b + η (y − ŷ)η = learning rate · (y − ŷ) is 0 when correct, ±2 when wrong
The error term (y − ŷ) is zero on correct points, so only mistakes change the weights.

The key is the error term (y − ŷ). When the prediction is right, y − ŷ = 0 and the weights don’t move. When it is wrong — say y = +1 but ŷ = −1y − ŷ = +2, so we add a multiple of x to w. That pushes the score wᵀx up for this exact point, dragging it toward the positive side. Geometrically, each update rotates the decision boundary toward correctly classifying the point it just missed.

old boundary (z = 0)after updatey = +1 (was misclassified)boundary rotates toward the missed point
The positive point sat on the wrong side of the solid line; after one update the dashed boundary has rotated so the point is now classified +.

This repeats over the data. The Perceptron Convergence Theorem guarantees the process halts with zero errors — but only if the classes are linearly separable. If no straight line can separate them, the perceptron never settles.

How GATE asks this

Almost always an MCQ or NAT asking for the effect of a single update: given w, b, a misclassified point x, its true label y, and the learning rate η, compute the new weights or show that the score wᵀx moves toward the correct side. This single-neuron update is also the building block behind the neural-network questions GATE DA has asked every year (2024–2026). A conceptual variant asks why a single-layer perceptron cannot learn XOR (the answer: XOR is not linearly separable).

Worked example — one update step

Current weights w = (1, 0), bias ignored for this step. A point x = (2, 1) has true label y = +1, but the perceptron currently predicts ŷ = −1. With learning rate η = 1, find the new weights and check that wᵀx improves.

First confirm it’s a mistake. Old score:

wᵀx = (1)(2) + (0)(1) = 2

Wait — wᵀx = 2 > 0, so this would normally predict +1. We are told the model predicted ŷ = −1 (e.g. the bias b = −3 made z = 2 − 3 = −1 < 0). Either way the point is misclassified, so apply the update. The error term is y − ŷ = +1 − (−1) = 2:

w_new = w + η·(y − ŷ)·x
      = (1, 0) + 1 · 2 · (2, 1)
      = (1, 0) + (4, 2)
      = (5, 2)

Now recompute the score for the same point with the new weights:

w_newᵀx = (5)(2) + (2)(1) = 10 + 2 = 12

The score jumped from 2 to 12 — much more strongly positive. The update pushed this point firmly onto the correct (positive) side, exactly as intended. The weight vector grew in the direction of x, which is what rotates the boundary toward the missed point.

Quick check

Quick check

0/6
Q1Weights w = (1, 0), learning rate η = 1. A misclassified point x = (2, 1) has true label y = +1 and prediction ŷ = −1. After one update, what is the first component of the new weight vector w_new?numerical answer — type a number
Q2Continuing the example, after the update to w = (5, 2), what is the new score wᵀx for the same point x = (2, 1)?numerical answer — type a number
Q3When the perceptron's prediction ŷ already equals the true label y, what does the update rule w ← w + η(y − ŷ)x do?
Q4Which statements about the single-layer perceptron are TRUE? (select all that apply)select all that apply
Q5A perceptron is trained on data that is NOT linearly separable. What happens?
Q6Weights w = (0, 1), η = 1. A point x = (3, −2) with true label y = +1 is misclassified as ŷ = −1. What is the SECOND component of w_new?numerical answer — type a number

Practice this in an interview

All questions
What does a single artificial neuron (perceptron) actually compute?

A neuron takes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. The weights encode learned feature importance, the bias shifts the decision boundary, and the activation introduces the non-linearity needed for complex mappings.

Walk me through the forward pass of a neural network end-to-end.

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

What is backpropagation and how does the chain rule make it work?

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.

How does dropout work, and why must it behave differently during training and inference?

Dropout randomly zeroes each neuron's output with probability p during training, forcing the network to learn redundant representations and preventing co-adaptation of neurons. At inference, dropout is disabled and all neurons are active — but to keep expected activations the same as during training, outputs are scaled by 1/(1−p). Forgetting to switch modes produces incorrect, noisy predictions.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content