datarekha
Deep Learning Easy Asked at GoogleAsked at MetaAsked at NVIDIA

What does a convolution operation do in a CNN?

The short answer

A convolution slides a small learned weight matrix (kernel) across the input, computing a dot product at each position to produce a feature map. Each kernel learns to detect one spatial pattern — an edge, a corner, a texture — regardless of where it appears in the image.

How to think about it

Nail the spatial mechanics first, then explain what the learned values actually represent — that’s what separates a rehearsed answer from a real one.

The mechanics

A 2-D convolution takes an input of shape H x W x C and a kernel of shape k x k x C. At every spatial position (i, j) the operation computes:

output[i, j] = sum over (di, dj, c) of input[i+di, j+dj, c] * kernel[di, dj, c] + bias

The kernel strides across the full spatial extent, producing one scalar per position. Stack C_out different kernels and you get a feature map of shape H_out x W_out x C_out.

What the kernel is learning

Each kernel is a small template. After training, early kernels typically resemble Gabor-like edge detectors; deeper kernels respond to textures and object parts. The network discovers these patterns purely from gradient descent on the task loss — no hand-engineering required.

Input (5×5)3×3 kernelFeature map (3×3)
A 3×3 kernel slides over a 5×5 input (stride 1, no padding) to yield a 3×3 feature map.

Key output size formula

H_out = floor((H_in + 2P - k) / S) + 1

where P is padding and S is stride.

Learn it properly PyTorch quickstart

Keep practising

All Deep Learning questions

Explore further

Skip to content