A number of people have suggested that the reason deep learning works is because it is very flexible, because an infinitely wide neural network is a universal approximator[^1].
However, that cannot be the case. There are many other regression methods that can learn any function, yet they are not able to achieve the same results that deep learning gets. One example is the Gaussian process, or Support vector machine, with an RBF kernel.
The RBF (Radial Basis Function) kernel is a consistent prior. This implies that, given any underlying function, in the limit of infinitely many data points it will be able to fit it. Thus, RBF kernel machines can represent any[^2] function.
The trouble comes when any of these methods need to generalise to unseen inputs. Indeed, even “interpolation” tasks in relatively dense regions of training data need generalization: we do not have all the points of the function, only a few of them. What value, thus, should we assign to the function between these points?
A B C D
All of them explain the data equally well, but we might say that some are more sensible than others.
So the process of making machine learning models that are better at some or other task is the process of making them more and more specific to this task (or to the set of real world tasks), but no more specific than they can be.
: more formally, any bounded function within a bounded domain. : that is continuous almost everywhere, i.e. everywhere but a set of zero Lebesgue measure.