What are filters and feature maps in a CNN, and what do they represent?
A filter (kernel) is the set of learned weights that the network applies at each spatial position; a feature map is the spatial grid of responses produced when one filter slides over the input. Each filter detects one type of pattern, and the full stack of feature maps across all filters constitutes the layer's output representation.
How to think about it
This is often the entry point question before deeper CNN probes. Be precise about shapes and what each dimension represents, then connect to visualisation work that shows what filters actually learn.
Filters (kernels)
A filter is a 3-D tensor of shape k x k x C_in. Each element is a learned weight; the filter scans the input by computing a dot product at every spatial position. A conv layer has C_out such filters — one per output channel.
Filter shape: k x k x C_in — spatial extent times depth of input
Number of filters: C_out — one per output channel
Total weight tensor: k x k x C_in x C_out
After training, filters in early layers typically resemble Gabor-like edge detectors (oriented bars at different frequencies). Deeper filters respond to complex compositions: eyes, wheels, fur textures.
Feature maps
When a single filter slides over the input, it produces a 2-D grid of scalars — one number per spatial position. This grid is the feature map for that filter. Its value at position (i, j) measures how strongly that filter’s pattern is present at location (i, j) in the input.
Feature map shape for one filter: H_out x W_out
Full layer output: H_out x W_out x C_out — a stack of C_out feature maps
Visualising filters
The Zeiler & Fergus (2014) deconvnet visualisation showed that:
- Layer 1 filters detect oriented edges and colour blobs
- Layer 2 detects corners and simple textures
- Layer 3+ detects increasingly complex and class-specific patterns
Relationship to channels
Input channels and output channels have different roles:
- Input channels (
C_in): each filter covers all input channels simultaneously — colour, or prior-layer features - Output channels (
C_out): each filter produces one feature map; allC_outmaps together form the new representation