A number of people have suggested that the reason deep learning works is because it is very flexible, because an infinitely wide neural network is a universal approximator[^1].

However, that cannot be the case. There are many other regression methods that
can learn *any* function, yet they are not able to achieve the same results that
deep learning gets. One example is the Gaussian process, or Support vector
machine, with an RBF kernel.

The RBF (Radial Basis Function) kernel is a *consistent* prior. This implies
that, given any underlying function, in the limit of infinitely many data points
it will be able to fit it. Thus, RBF kernel machines can represent any[^2] function.

The trouble comes when any of these methods need to generalise to unseen inputs. Indeed, even “interpolation” tasks in relatively dense regions of training data need generalization: we do not have all the points of the function, only a few of them. What value, thus, should we assign to the function between these points?

A B C D

All of them explain the data equally well, but we might say that some are more sensible than others.

So the process of making machine learning models that are better at some or
other task is the process of making them more and more *specific* to this task
(or to the set of real world tasks), but *no more* specific than they can be.

[1]: more formally, any bounded function within a bounded domain. [2]: that is continuous almost everywhere, i.e. everywhere but a set of zero Lebesgue measure.