Deep Learning Easy Asked at GoogleAsked at NVIDIAAsked at Meta

What is pooling, and when would you choose max pooling over average pooling?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Pooling downsamples a feature map by aggregating values in a local window, reducing spatial dimensions and building position tolerance. Max pooling takes the strongest activation in each window; average pooling takes the mean. Max pooling dominates in classification backbones; average pooling is preferred in global summarisation and smooth feature maps.

How to think about it

Don’t just define them — give the intuition for when each wins, then mention global average pooling as a modern replacement for the fully-connected head.

What pooling does

After a conv layer produces a feature map, pooling applies a reduction over non-overlapping (or strided) windows. For a 2×2 pool with stride 2 on an H x W map:

Output size: H/2 x W/2
No learned parameters

Pooling achieves two things: it compresses the representation (fewer computations downstream) and introduces a degree of local translation tolerance (the feature still fires if the pattern shifts by one pixel inside the window).

Max pooling

output = max(values in window)

Preserves the strongest signal. If any position in the window detected the feature, the map says “yes, feature present here.” This is ideal for classification tasks where you care whether a pattern exists, not its exact sub-window location.

Average pooling

output = mean(values in window)

Spreads the signal uniformly. Useful when you want a smooth, holistic summary of a region rather than the sharpest activation.

Global average pooling (GAP) collapses an entire H x W feature map to a single scalar per channel: output[c] = mean over all H*W positions for channel c. Modern architectures (ResNet, MobileNet) use GAP instead of flattening into a dense layer, drastically cutting parameters and regularising against overfitting.

Comparison

	Max pooling	Average pooling
Preserves	Strongest activations	Overall activation level
Good for	Object detection signals	Smooth features, GAP
Gradients	Only winner gets gradient	Spread across all positions
Common use	VGG, early ResNets	GAP head, inception modules