Analysis of a Two-Layer Neural Network via Displacement Convexity
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Abstract
We consider the problem of learning a function defined on a compact domain, using linear combinations of a large number of “bump-like” components (neurons). This idea lies at the core of a variety of methods from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Nonetheless, little is known about global convergence properties of these approaches. In this work, we show that, as the number of neurons diverges and the bump width tends to zero, the gradient flow has a limit which is a viscous porous medium equation. By virtue of a property named “displacement convexity,” we show an exponential dimension-free convergence rate for gradient descent.
This post is based on the paper (Javanmard et al. 2020).
Fitting a function with a linear combination of components. In
supervised learning, we are given data points
{(yj,xj)}j≤n, which are often assumed to be
independent and identically distributed. Here
xj∈Rd is a feature vector, and
yj∈R is a label or response variable. We would like to
fit a model ˆf:Rd→R to predict the
labels at new points x∈Rd. One of the most
fruitful ideas in this context is to use functions that are linear
combinations of simple components:
ˆf(x;w)=1NN∑i=1σ(x;wi),
A common approach towards fitting parametric models is by risk
minimization:
RN(w)=E{[y−1NN∑i=1σ(x;wi)]2}.
Despite the impressive practical success of these methods, the risk function RN(w) is highly non-convex and little is known about the global convergence of algorithms that try to minimize it. The main objective of this work is to introduce a nonparametric underlying regression model for which a global convergence result can be proved for stochastic gradient descent (SGD), in the limit of a large number of neurons.
Setting and main result. Let Ω⊂Rd be a
compact convex set with smooth boundary, and let
{(yj,xj)}j≥1 be i.i.d. with
xj∼Unif(Ω) and
E(yj|xj)=f(xj), where the
function f is smooth. We try to fit these data using a combination of
bumps, namely
ˆf(x;w)=1NN∑i=1Kδ(x−wi),
We prove that, for sufficiently large N and small δ, gradient descent algorithms converge to weights w with nearly optimum prediction error, provided f is strongly concave. Let us emphasize that the resulting population risk RN(w) is non-convex regardless of the concavity properties of f. Our proof unveils a novel mechanism by which global convergence takes place. Convergence results for non-convex empirical risk minimization are generally proved by carefully ruling out local minima in the cost function. Instead we prove that, as N→∞ and δ→0, the gradient descent dynamics converges to a gradient flow in Wasserstein space, and that the corresponding cost function is ‘displacement convex.’ Breakthrough results in optimal transport theory guarantee dimension-free convergence rates for this limiting dynamics (Carrillo, McCann, and Villani 2003). In particular, we expect the cost function RN(w) to have many local minima, which are however completely neglected by gradient descent.
Proof idea. Let us start by providing some high-level insights about our approach. Think about each model parameter as a particle moving under the effect of other particles according to the SGD updates. Now, instead of studying the microscopic dynamics of this system of particles, we analyze the macroscopic dynamics of the medium when the number of particles (i.e., the size of the hidden layer of the neural network) goes to infinity. These dynamics are formulated through a partial differential equation (more specifically, a viscous porous medium equation) that describes the evolution of the mass density over space and time. The nice feature of this approach is that, while the SGD trajectory is a random object, it shows that in the large particle size limit, it concentrates around the deterministic solution of this partial differential equation (PDE).
For a rigorous analysis and implementation of this idea, we use
propagation-of-chaos techniques. Specifically, we show that, in the
large N limit, the evolution of the weights
{wi}Ni=1 under gradient descent can be replaced
by the evolution of a probability distribution ρδt which
satisfies the viscous porous medium PDE (with Neumann boundary
conditions). This PDE can also be described as the Wasserstein W2
gradient flow for the following effective risk
Rδ(ρ)=1|Ω|∫Ω[f(x)–Kδ∗ρ(x)]2dx,
Note that even though the cost Rδ(ρ) is quadratic and
convex in ρ, its W2 gradient flow can have multiple fixed
points, and hence global convergence cannot be guaranteed. Indeed, the
mathematical property that controls global convergence of W2 gradient
flows is not ordinary convexity but displacement convexity. Roughly
speaking, displacement convexity is convexity along geodesics of the
W2 metric. Note that the risk function Rδ(ρ) is not
even displacement convex. However, for small δ, we can formally
approximate Kδ∗ρ≈ρ, and hence hope to replace
the risk function Rδ(ρ) with the simpler one
R(ρ)=1|Ω|∫Ω[f(x)–ρ(x)]2dx.
Remarkably, the risk function R(ρ) is strongly displacement convex (provided f is strongly concave). A long line of work in PDE and optimal transport theory establishes dimension-free convergence rates for its W2 gradient flow (Carrillo, McCann, and Villani 2003). By putting everything together, we are able to show that SGD converges exponentially fast to a near-global optimum with a rate that is controlled by the convexity parameter of f.
A numerical illustration. We demonstrate in a simple numerical example that the convergence rate predicted by our asymptotic theory is in excellent agreement with simulations. In the left column of the figure below, we plot the true function f together with the neural network estimate obtained by running stochastic gradient descent (SGD) with N=200 neurons at several points in time t. Different plots correspond to different values of δ with δ∈{1/5,1/10,1/20}. We observe that the network estimates converge to a limit curve which is an approximation of the true function f. As expected, the quality of the approximation improves as δ gets smaller. In the right column, we report the evolution of the population risk and we compare it to the risk predicted by the limit PDE (corresponding to δ=0). The PDE curve appears to capture well the evolution of SGD towards optimality.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.