Deep Learning Medium Asked at GoogleAsked at MetaAsked at NVIDIAAsked at Microsoft

What is the receptive field of a neuron in a CNN and how does it grow with depth?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

The receptive field is the region in the original input that influences a single activation in a given layer. With a 3x3 kernel and stride 1, each layer adds 2 pixels to the receptive field in each direction; stacking L layers gives a theoretical receptive field of (2L+1) x (2L+1). Stride, pooling, and dilated convolutions all expand the receptive field faster.

How to think about it

Interviewers test whether you understand why depth matters and how architecture choices control context. Show the growth formula, then link it to practical architecture design.

Definition

A single unit in layer l depends on a r x r patch of the original input — this is its receptive field. Two units in the same layer that are spatially adjacent may have overlapping receptive fields; their intersection is the “shared context.”

Growth formula (stride-1, kernel k)

After one conv layer of kernel size k: receptive field = k

After adding another layer of kernel size k: receptive field = k + (k-1) = 2k - 1

In general, for L stacked 3×3 layers with stride 1:

RF = 2*L + 1

Layers L	RF (3×3 stack)	RF (5×5 single)
1	3	5
2	5	—
3	7	—
5	11	—

This is why VGG replaced large kernels with stacks of 3×3 convolutions: three 3×3 layers give the same receptive field as one 7×7 layer with fewer parameters (3*(9*C^2) vs 49*C^2) and two extra nonlinearities.

Effect of stride and pooling

A stride-2 layer or a 2×2 pooling layer doubles the jump between positions, so subsequent layers expand the receptive field twice as fast.

For a network with strides S1, S2, ...SL and kernel sizes k1, k2, ...kL, the effective receptive field is computed recursively:

RF_l = RF_{l-1} + (k_l - 1) * product of all prior strides

Dilated (atrous) convolutions

Dilation rate d spaces kernel positions d pixels apart, giving a receptive field of k + (k-1)*(d-1) for a single layer without adding parameters or losing resolution. Used in semantic segmentation (DeepLab) to maintain spatial resolution while capturing large context.

Effective vs theoretical receptive field

Empirically, not all input pixels inside the theoretical receptive field contribute equally — central pixels have much higher influence than edge pixels. The “effective receptive field” follows a roughly Gaussian distribution, not a flat square.

Learn it properly Convolutional neural networks