Recent work has shown that the outputs of convolutional neural networks become Gaussian process (GP) distributed when we take the number of channels to infinity. In principle, these infinite networks should perform very well, both because they allow for exact Bayesian inference, and because widening networks is generally thought to improve (or at least not diminish) performance. However, Bayesian infinite networks perform poorly in comparison to finite networks, and our goal here is to explain this discrepancy. We note that the high-level representation induced by an infinite network has very little flexibility; it depends only on network hyperparameters such as depth, and as such cannot learn a good high-level representation of data. In contrast, finite networks correspond to a rich prior over high-level representations, corresponding to kernel hyperparameters. We analyse this flexibility from the perspective of the prior (looking at the structured prior covariance of the top-level kernel), and from the perspective of the posterior, showing that the representation in a learned, finite deep linear network slowly transitions from the kernel induced by the inputs towards the kernel induced by the outputs, both for gradient descent, and for Langevin sampling. Finally, we explore representation learning in deep, convolutional, nonlinear networks, showing that learned representations differ dramatically from the corresponding infinite network.
|Publication status||Accepted/In press - 10 Nov 2019|