Walk me through the forward pass of a neural network end-to-end.
The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.
How to think about it
Given an L-layer network, layer l computes:
z^[l] = W^[l] · a^[l-1] + b^[l]
a^[l] = f^[l](z^[l])
where a^[0] = x (the raw input). The final layer produces a^[L], the network’s prediction.
Full mini-batch forward pass in PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TwoLayerNet(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden)
self.fc2 = nn.Linear(hidden, out_dim)
def forward(self, x):
z1 = self.fc1(x) # linear: W1·x + b1
a1 = F.relu(z1) # activation — stored in graph
z2 = self.fc2(a1) # linear: W2·a1 + b2
return z2 # logits (no softmax here; use CrossEntropyLoss)
model = TwoLayerNet(784, 256, 10)
logits = model(x_batch) # forward pass
loss = nn.CrossEntropyLoss()(logits, y_batch)
PyTorch builds a computation graph during the forward pass, recording every operation. Each intermediate tensor z, a is a node. This graph is what loss.backward() traverses in reverse to compute gradients — so nothing in the forward pass is “wasted”.
Why cache intermediates? The gradient of the loss w.r.t. W^[l] requires a^[l-1], which was computed during the forward pass. Storing these values avoids recomputation.