What does a convolution operation do in a CNN?
A convolution slides a small learned weight matrix (kernel) across the input, computing a dot product at each position to produce a feature map. Each kernel learns to detect one spatial pattern — an edge, a corner, a texture — regardless of where it appears in the image.
How to think about it
Nail the spatial mechanics first, then explain what the learned values actually represent — that’s what separates a rehearsed answer from a real one.
The mechanics
A 2-D convolution takes an input of shape H x W x C and a kernel of shape k x k x C. At every spatial position (i, j) the operation computes:
output[i, j] = sum over (di, dj, c) of input[i+di, j+dj, c] * kernel[di, dj, c] + bias
The kernel strides across the full spatial extent, producing one scalar per position. Stack C_out different kernels and you get a feature map of shape H_out x W_out x C_out.
What the kernel is learning
Each kernel is a small template. After training, early kernels typically resemble Gabor-like edge detectors; deeper kernels respond to textures and object parts. The network discovers these patterns purely from gradient descent on the task loss — no hand-engineering required.
Key output size formula
H_out = floor((H_in + 2P - k) / S) + 1
where P is padding and S is stride.