For a stack of convolution layers (stride=1, dilation=1), the receptive field of an output pixel is the set of input pixels that can influence it. With kernel sizes \(k_1,\dots,k_L\) and stride \(s_m = 1\), the receptive-field size along one axis after \(L\) layers is
\[ r_L \;=\; 1 + \sum_{\ell=1}^{L} (k_\ell-1) \prod_{m=1}^{\ell-1} s_m\;\;\text{with stride } s_m=1\;\Rightarrow\; r_L = 1 + \sum_{\ell=1}^{L} (k_\ell-1). \]
The effective receptive field (ERF) describes how influence is distributed within that receptive field. While the theoretical RF defines the maximum spatial extent, the ERF shows that not all pixels contribute equally. With random weights, the ERF follows approximately a Gaussian distribution centered in the RF, with influence decaying toward the edges (Luo et al., 2016). However, training can significantly reshape the ERF – the network may learn to focus attention on specific spatial patterns, potentially creating non-Gaussian or multi-modal ERFs depending on the task and data.
Architectural Effects. Adding layers increases the RF size linearly: each additional layer with kernel size \(k\) adds \((k-1)\) to the total RF. Larger kernels have a more dramatic effect, as a single 7×7 layer contributes as much as three 3×3 layers (\(7-1 = 6\) vs \(3×(3-1) = 6\)). However, stacking smaller kernels often provides better representational capacity and training dynamics than single large kernels.
We approximate the untrained ERF here by repeatedly convolving a delta function with uniform \(k\times k\) kernels \(L\) times, then visualizing the normalized weights. This shows the "default" influence pattern before training.