datarekha
Deep Learning Easy Asked at GoogleAsked at NVIDIAAsked at Meta

What is pooling, and when would you choose max pooling over average pooling?

The short answer

Pooling downsamples a feature map by aggregating values in a local window, reducing spatial dimensions and building position tolerance. Max pooling takes the strongest activation in each window; average pooling takes the mean. Max pooling dominates in classification backbones; average pooling is preferred in global summarisation and smooth feature maps.

How to think about it

Don’t just define them — give the intuition for when each wins, then mention global average pooling as a modern replacement for the fully-connected head.

What pooling does

After a conv layer produces a feature map, pooling applies a reduction over non-overlapping (or strided) windows. For a 2×2 pool with stride 2 on an H x W map:

  • Output size: H/2 x W/2
  • No learned parameters

Pooling achieves two things: it compresses the representation (fewer computations downstream) and introduces a degree of local translation tolerance (the feature still fires if the pattern shifts by one pixel inside the window).

Max pooling

output = max(values in window)

Preserves the strongest signal. If any position in the window detected the feature, the map says “yes, feature present here.” This is ideal for classification tasks where you care whether a pattern exists, not its exact sub-window location.

Average pooling

output = mean(values in window)

Spreads the signal uniformly. Useful when you want a smooth, holistic summary of a region rather than the sharpest activation.

Global average pooling (GAP) collapses an entire H x W feature map to a single scalar per channel: output[c] = mean over all H*W positions for channel c. Modern architectures (ResNet, MobileNet) use GAP instead of flattening into a dense layer, drastically cutting parameters and regularising against overfitting.

Comparison

Max poolingAverage pooling
PreservesStrongest activationsOverall activation level
Good forObject detection signalsSmooth features, GAP
GradientsOnly winner gets gradientSpread across all positions
Common useVGG, early ResNetsGAP head, inception modules

Keep practising

All Deep Learning questions

Explore further

Skip to content