A small step to understand Generative Adversarial Networks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In the last decade, there have been spectacular advances on the practical side of machine learning. One of the most impressive may be the success of Generative Adversarial Networks (GANs) for image generation (Goodfellow et al. 2014). State of the art models are capable of producing portraits of fake persons that look perfectly authentic to you and me (see e.g. (Salimans et al. 2016) and (Karras et al. 2018)). Other domains such as inpainting, text to image and speech are also concerned by outstanding results (see (Goodfellow 2016) and (Jabbar, Li, and Bourahla 2020)).
Since their introduction by (Goodfellow et al. 2014), GANs have unleashed passions in the community of machine learning, leading to a large volume of variants and possible applications, often referred to as the GAN Zoo. However, despite increasingly spectacular applications, little was known few years ago about the statistical properties of GANs.
This post sketches the paper entitled “Some Theoretical Properties of GANs’’ (G. Biau, B. Cadre, M. Sangnier and U. Tanielian, The Annals of Statistics, 2020), which aims at building a statistical analysis of GANs in order to better understand their mathematical mechanism. In particular, it proves a non-asymptotic bound on the excess of Jensen-Shannon error and the asymptotic normality of the parametric estimator.
Mathematical framework
Overview of the method
The objective of GANs is to randomly generate artificial contents similar to some data. Put another way, they are aimed at sampling according to an unknown distribution P⋆, based solely on i.i.d. observations X1,…,Xn drawn according to P⋆. Obviously, a naive approach would be to:
- Estimate the distribution P⋆ by some ˆP.
- Sample according to ˆP.
However, both tasks are difficult in themselves. In particular, density estimation is made arduous by the complexity and high dimensionality of the data involved in the domain, relegating both standard parametric and nonparametric approaches unworkable. Thus, GANs offer a completely different way to achieve our goal, often compared to the struggle between a police team, trying to distinguish true banknotes (the observed data X1,…,Xn) from false ones (the generated data), and a counterfeiters team (the generator), slaving to produce banknotes as credible as possible and to mislead the police.
To be a bit more specific, there are two brilliant ideas at the core of GANs:
- Sample data in a very straightforward manner thanks to the transform method: let G={Gθ:Rℓ→Rd,θ∈Θ}, where ℓ is the dimension of what is called the latent space and Θ⊂Rp, be a class of measurable functions, called generators (in practice G is often a class of neural networks with p parameters). Now, let us sample Z∼N(0,Iℓ) and compute U=Gθ(Z). Then, U is an observation drawn according to the distribution Pθ=Gθ#N(0,Iℓ) (the push-forward measure of the latent distribution (N(0, I_)) according to (G_)). In other words, the statistical model for the estimation of P⋆ has the form P={Pθ=Gθ#N(0,Iℓ),θ∈Θ} and it is definitely straightforward to sample according to Pθ.
- Assessing the proximity between Pθ and P⋆ by comparing two samples X1,…,Xni.i.d.∼P⋆ and U1,…,Uni.i.d.∼Pθ. What does comparing mean? Assume the group of X1,…,Xn is very difficult to “separate’’ from the group of U1,…,Un, or put another way, it is very difficult to distinguish the class of X1,…,Xn from the class of U1,…,Un. Would you be convinced that the two distributions Pθ and P⋆ are very close (at least for large n)? That is exactly the point.
Comparing distributions
At this point, Task 2 is still a bit blurry and deserves further details about how to quantify the difficulty (or the ease) of separating the two classes X1,…,Xn and U1,…,Un. This problem is actually closely related to supervised learning, and in particular to classification: assume that a classifier, let us say h:Rd→{0,1}, manages to perfectly discriminate the two classes: P(h(X1)=1)=P(h(U1)=0)=1, then we can say that the two distributions Pθ and P⋆ are different. Conversely, if the classifier is fooled, that is P(h(X1)=1)=P(h(U1)=0)=12, we may accept that the two distributions are identical.
This classification setting is formalized as following:
let (X,Y) be a pair of random variables taking values in Rd×{0,1} such that:
X|Y=1∼P⋆andX|Y=0∼Pθ,
The sample ({ (X_1, 1), , (X_n, 1), (U_1, 0), , (U_n, 0) }), previously build by putting together observed and generated data, fits the classification model and can serve for estimating a classifier by maximizing the conditional log-likelihood:
ˆα∈argmaxα∈ΛˆL(θ,α),whereˆL(θ,α)=1nn∑i=1log(Dα(Xi))+1nn∑i=1log(1−Dα(Ui)).
In statistical terms, P⋆ can be estimated by Pˆθ, where:
ˆθ∈argminθ∈Θsupα∈ΛˆL(θ,α),
The story of GANs is not that gleaming since, in practice, we never have access to Pˆθ, which may be a very complicated object, but only to the generator Gˆθ. Anyway, our aim is to sample according to P⋆, which can be achieved (up to the estimation error) thanks to Gˆθ(Z), where Z∼N(0,Iℓ).
Actually, in this work, Pθ is just as a mathematical object helping to understand GANs. To go into details, the forthcoming results are based on the assumption that all distributions in play are absolutely continuous with respect to a known measure μ (typically the Hausdorff measure on some submanifold of Rd) and probability density functions are noted with lowercase letters (in particular P⋆=p⋆dμ and Pθ=pθdμ).
Results
Concerning the comparison of distributions
In order to give a mathematical foundation to our intuition in Task 2, it may be useful to analyze the big sample case, where
ˆL(θ,α)≈E[ˆL(θ,α)]=E[log(Dα(X1))]+E[log(1−Dα∘Gθ(Z1))].
If the class of discriminators D={Dα,α∈Λ} is rich enough to contain the “optimal’’ discriminator D⋆θ=p⋆pθ+p⋆ for all θ∈Θ,
then
supα∈ΛE[ˆL(θ,α)]=2DJS(P⋆,Pθ)–log4,
Two things can be learned from this first result (still assuming that D contains “optimal’’ discriminators):
- Up to the approximation capacity of D, supα∈ΛˆL(θ,α) does reflect the proximity between Pθ and P⋆ (thanks to an approximated divergence).
- ˆθ cannot be better than θ⋆∈argminθ∈Θsupα∈ΛE[ˆL(θ,α)]=argminθ∈ΘDJS(P⋆,Pθ), which leads to the approximation Pθ⋆ of P⋆ obtained by minimizing the Jensen-Shannon divergence.
Non-asymptotic bound on Jensen-Shannon error
Thus, GANs drive the world downhill in two directions:
- A limited approximation capacity for the class of discriminators D (which may not contain the “optimal’’ discriminator D⋆θ=p⋆pθ+p⋆ for all θ∈Θ): supα∈ΛE[ˆL(θ,α)]<2DJS(P⋆,Pθ)−log4.
- A finite sample approximation: the criterion maximized is ˆL(θ,α) instead of E[ˆL(θ,α)].
These limitations introduce two kinds of error in the estimation procedure: an approximation error (or bias), induced by the richness of D, and an estimation error (or variance) occurring from the finiteness of the sample.
This can be formalized in the following manner:
assume some regularity conditions of the first order on the models P, G and D
and assume that optimal discriminators D⋆θ can be approximated by D up to an error ϵ>0 in L2-norm.
Then:
E[DJS(P⋆,Pˆθ)]–DJS(P⋆,Pθ⋆)=O(ϵ2+1√n).
This result explains quantitatively that the discriminators in GANs have to be tuned carefully: on the one hand, poor discriminators induce an uncontrolled gap between supα∈ΛE[ˆL(θ,α)] and DJS(P⋆,Pθ); on the other hand, very flexible discriminators may lead to overfitting the finite sample.
The first assertion is illustrated in the next figure. The numerical experiment has been set up with classes of fully connected neural networks for the generators and the discriminators (respectively G and D) and n sufficiently large. The depth of the generators is either 2 (blue bars) or 3 (green bars) and the depth of the discriminator ranges from 2 to 5 (from left to right). As expected, it appears clearly that the more flexible the discriminators are (from left to right), the smaller DJS(P⋆,Pˆθ) is. Obviously, this is also inversely correlated with the richness of the class of generators G (at least in a first regime).
Asymptotic normality
As a second important result, it can be shown that the estimator ˆθ is asymptotically normal with convergence rate √n.
More formally, let us assume ˉθ∈argminθ∈Θsupα∈ΛE[ˆL(θ,α)] exists and is unique.
Let also assume some regularity conditions of the second order on the models P, G and D,
well definiteness and smoothness of θ↦argmaxα∈ΛE[ˆL(θ,α)] around ˉθ.
Then, there exists a covariance matrice Σ such that:
√n(ˆθ–ˉθ)dist→N(0,Σ).
Conclusion
GANs have been statistically analyzed from the estimation point of view. Even though some simplifications were made (known dominating measure μ, uniqueness of some quantities) compared to the empirical setting based on deep neural networks, the theoretical results show the importance of tuning correctly the architecture of the discriminators, and exhibit an asymptotic behavior similar to that of a standard M-estimator.
It remains to study the impact of the architecture of neural nets on the performance of GANs, as well as their behavior in an overparametrized regime. But that’s a different story.
This post is based on
G. Biau, B. Cadre, M. Sangnier and U. Tanielian. 2020. “Some Theoretical Properties of GANs.’’ The Annals of Statistics 48(3): 1539-1566.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.