Linear-cost unbiased estimator for large crossed random effect models via couplings
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the following we show how it is possible to obtain parallelizable, unbiased and computationally cheap estimates of Crossed random effects models with a linear cost in the number of datapoints (and paramaters) exploiting couplings.
Crossed random effects models (CREM)
CREM model a continuous response variables Y as depending on the sum of unknown effects of K categorical predictors. Think of the Yn as the ratings given to university courses, along with some factors potentially impacting such a score, e.g. student identity, code of the course, department teaching it, professors ecc. Aim of the model is investigating the effect of each of those factors on the overall score. In their simplest version (i.e. linear, intercept-only case), the model takes the form:
L(yn)=f(μ+K∑k=1a(k)ik[n],τ−10) for n=1,…,N,
We are interested in studying how the cost of estimating the unknown effects scales as the number of observations N and of parameters p grows to ∞. Our goal is an algorithm whose complexity scales linearly in N and p, and we call such algorithms “scalable”. Both in the Frequentist and in the Bayesian literature, these models are difficult to estimate: works of Gao and Owen (2016a) and Gao and Owen (2016b) showed how the “vanilla” implementation of GLS and of Gibbs samplers have a computational cost that grows at best as O(N32). Recent works by Papaspiliopoulos, Roberts, and Zanella (2019) and Ghosh, Hastie, and Owen (2022) proposed, respectively, a collapsed Gibbs sampler and a “backfitting” iterative algorithm that exhibit computational costs linear in the number of observations; in particular the MCMC induced by the collapsed scheme is proved to have a mixing time that is O(1) under certain asymptotic regimes.
It is possible to further improve the MCMC estimates exploiting couplings: as showed in Jacob, O’Leary, and Atchadé (2020) and Glynn and Rhee (2014), coupling two MCMC chains allows to derive unbiased estimators of posterior quantities, provided that the two coupled chains are exactly equal after a finite number of iterations. Furthermore, the same construction provides theoretical foundations for the early stopping of the chains (once met) and allows for the parallelization of independent experiments.
The extra computational cost one has to pay is represented by the product between the cost of each iteration and the expected number of iterations needed for coalescence. As for the former, it is possible to devise many coupling algorithms for which the cost of each iteration is easily computable and linear in the number of observations. As for the expected meeting time, for chains arising from a Gibbs sampling scheme targeting Gaussian distributions, it is possible to show that is directly related to the mixing time of the single Markov chain, and indeed it differs from the latter only by a logarithmic factor up to some constants (see the Sections below for more details). Hence chains that mix fast also meet in a small number of iteration and therefore provide unbiased estimates with low computational cost.
Couplings for estimation
Theoretically speaking, given X,Y random variables distributed according to P,Q respectively, a coupling of the two is random variables (X,Y) on the joint space such that the marginal distribution of X is P and the marginal distribution of Y is Q. Clearly, given two marginal distributions, there are infinitely many joint distributions with those as marginals. Below some of the possible couplings of a N(1,1) (on the x-axis) and N(0,1) on the y-axis. Starting clockwise from top left: maximal independent (i.e. a coupling maximizing the probability of equal draws), maximal reflection, independent (bivariate independent normal) and W2-optimal (maximally correlated draws).
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Couplings of Gaussian distributions
Couplings might be used for obtaining unbiased estimators in MCMC inference, as shown in Jacob, O’Leary, and Atchadé (2020).
Given a target probability distribution π and an integrable function h, we are interested in estimating: Eπ[h(Θ)]=∫h(θ)π(dθ).
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Sample path of a single MC
Instead of waiting until convergence, one could run two coupled Markov chains (Θ1t)t≥−1 and (Θ2t)t≥0, which marginally starts from some base distribution π0 and evolves according to the same kernel P, but some correlation is induced in order to let the chains meet after an almost surely finite number of iterations. Basically at iteration t instead of sampling from Xt+1∼P(Xt,⋅) and Yt+1∼P(Yt,⋅) independently, we sample from a coupling of the two distributions.
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Sample path of coupled MCs
Then, for any fixed m≥k, we can run coupled chains for max(m,T) iterations and Hk:m is an unbiased estimator:
Hk:m=1m−k+1m∑l=kh(Θ1l)+τ∑l=k+1min(1,l−km−k+1)(h(Θ1l)−h(Θ2l))=MCMCk:m+BCk:m.
For more details give a look up at https://sites.google.com/site/pierrejacob/cmclectures.
In order two yield small meeting times, we implement a “two-step” coupling strategy: whenever the chains are “far away” (in some notion that will be clarified later) use a coupling whose aim is to bring the realizations closer to each other; whenever “close enough”, choose a coupling maximizing the meeting probabilities. The heuristic for this construction is that whenever a maximal coupling fails, components are sampled far away in the space, thus reducing the coalescence probability for the next steps.
Bound on coupling time
Consider (θt)t≥1=(θ1t,θ2t)t≥0, two π-reversible Markov chains arising from a Gibbs sampler targeting π=N(μ,Σ). If a “two-step” strategy is implemented, with maximal reflection and W2 optimal couplings, then for every δ>0, the meeting time T is bounded by
E[T|θ0]≤5+3max(n∗δ,Trel[ln(Trel)2+C0+Cε](1+δ)),
Given the above, if one is able to design a single MCMC chain mixing in say O(1), then the extra cost of an unbiased estimate is nothing more than a ln(O(1)) plus a constant.
Simulations
We simulate data coming from the model for different asymptotic regimes and parameter specification. We study the behaviour of the meeting times as the dimensionality of the model increase. We implement the “two step” Algorithm, using maximal reflection and W2-optimal couplings.
We consider two asymptotic regimes, called outfill asymptotic, where both the number of parameters p and the number of observation N increase, but with different speeds according to the chosen setting.
Consider first a model with K=2 factors and I1=I2={50,100,250,500,750,1000} different possible levels. Suppose that the number of observations grows quadratically wrt the number of parameters, i.e. N=O(p2)
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Average meeting times for Gaussian CREMS
Results of the simulations suggest that the expected number of iterations converges to O(1) as N increases, while diverges for the plain vanilla Gibbs sampler (not even plotted because of different scale). It is also clear that the bound closely resembles the estimated average meeting times.
We consider now an even sparser asymptotic regime, where instead the number of observation grows at the same rate of the number of parameters.
Np≈α
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Average meeting times for Gaussian CREMS
Lastly, we want to highlight that while the bound presented in the Theorem above only applies for Gibbs samplers targeting Gaussian distributions, the methodology is still valid for general target distributions and provide small meeting times. We report below the average meeting times of chains targeting a CREM with Laplace response.
L(yn)=Laplace(μ+K∑k=1a(k)ik[n]) for n=1,…,N,
![](https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif)
Average meeting times for Laplace CREMS
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.