Deep Learning Easy Asked at GoogleAsked at MetaAsked at AmazonAsked at Microsoft

How does dropout work, and why must it behave differently during training and inference?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Dropout randomly zeroes each neuron's output with probability p during training, forcing the network to learn redundant representations and preventing co-adaptation of neurons. At inference, dropout is disabled and all neurons are active — but to keep expected activations the same as during training, outputs are scaled by 1/(1−p). Forgetting to switch modes produces incorrect, noisy predictions.

How to think about it

Dropout (Srivastava et al., 2014) is a regularisation technique that prevents overfitting by stochastically deactivating neurons, forcing the network not to rely on any single pathway.

Training mode

Each neuron output is independently set to zero with probability p (commonly 0.1–0.5) on each forward pass. The surviving activations are scaled up by 1/(1−p) — this is inverted dropout, the modern standard — so the expected value of each activation is unchanged.

y = x · mask / (1 − p),   mask_i ~ Bernoulli(1−p)

Each mini-batch sees a different random sub-network, which ensemble-averages at inference.

Inference mode

Dropout is turned off: no neurons are zeroed. Because inverted dropout already corrected the scale during training, no further scaling is needed at inference.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(512, 256)
        self.drop = nn.Dropout(p=0.3)   # 30% dropout
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        return self.fc2(self.drop(torch.relu(self.fc1(x))))

model = MLP()

# --- Training ---
model.train()
logits = model(x_train)   # dropout is active

# --- Inference ---
model.eval()
with torch.no_grad():
    logits = model(x_test)  # dropout is disabled automatically

Where to place dropout

After fully-connected layers, before the activation (or after — results are similar; before is more common).
Avoid dropout directly before batch normalisation: the noise disrupts BN’s variance estimates.
Transformer implementations typically apply dropout after attention weights and after each sub-layer’s output.

Dropout rate guidelines

Context	Typical p
Large FC layers	0.5
Smaller layers / CNNs	0.1–0.3
Transformers (attention)	0.1
Fine-tuning pretrained	0.0–0.1

Learn it properly Dropout, BN, LN