datarekha
Deep Learning Easy Asked at AmazonAsked at MicrosoftAsked at Apple

Walk me through the forward pass of a neural network end-to-end.

The short answer

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

How to think about it

Given an L-layer network, layer l computes:

z^[l] = W^[l] · a^[l-1] + b^[l]
a^[l] = f^[l](z^[l])

where a^[0] = x (the raw input). The final layer produces a^[L], the network’s prediction.

Full mini-batch forward pass in PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoLayerNet(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, out_dim)

    def forward(self, x):
        z1 = self.fc1(x)        # linear: W1·x + b1
        a1 = F.relu(z1)         # activation — stored in graph
        z2 = self.fc2(a1)       # linear: W2·a1 + b2
        return z2                # logits (no softmax here; use CrossEntropyLoss)

model = TwoLayerNet(784, 256, 10)
logits = model(x_batch)         # forward pass
loss = nn.CrossEntropyLoss()(logits, y_batch)

PyTorch builds a computation graph during the forward pass, recording every operation. Each intermediate tensor z, a is a node. This graph is what loss.backward() traverses in reverse to compute gradients — so nothing in the forward pass is “wasted”.

Why cache intermediates? The gradient of the loss w.r.t. W^[l] requires a^[l-1], which was computed during the forward pass. Storing these values avoids recomputation.

Learn it properly Autograd

Keep practising

All Deep Learning questions

Explore further

Skip to content