Fairness and discrimination, PhD Course, #4 Wasserstein Distances and Optimal Transport
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For the fourth course, we will discuss Wasserstein distance and Optimal Transport. Last week, we mentioned distances, dissimilarity and divergences. But before talking about Wasserstein, we should mention Cramer distance.
Cramer and Wasserstein distances
The definition of Cramér distance, for \(k\geq1\), is
while Wasserstein will be (also for \(k\geq1\))
If we consider cumulative distribution functions, for the first one (Cramer), we consider some sort of “vertical” distance, while for the second one (Wasserstein), we consider some “horizontal” one,
Obviously, when \(k=1\), the two distances are identical
c1 = function(x) abs(pnorm(x,0,1)-pnorm(x,1,2))
w1 = function(x) abs(qnorm(x,0,1)-qnorm(x,1,2))
integrate(c1,-Inf,Inf)$value
[1] 1.166631
integrate(w1,0,1)$value
[1] 1.166636
But when \(k>1\), it is no longer the case.
c2 = function(x) (pnorm(x,0,1)-pnorm(x,1,2))^2
w2 = function(u) (qnorm(u,0,1)-qnorm(u,1,2))^2
sqrt(integrate(c2,-Inf,Inf)$value)
[1] 0.5167714
sqrt(integrate(w2,0,1)$value)
[1] 1.414214
For instance, we can illustrate with a simple multinomial distribution, and the distance with some Binomial one, with some parametric inference based on distance minimization \(\theta^\star=\text{argmin}\{d(p,q_{\theta})\}\)(where here a multinomial distribution with parameters \(\boldsymbol{p}=(.5,.1,.4)\), taking values respectively in \(\{0,1,10\}\), while the binomial distribution has probabilities \(\boldsymbol{q}_{\theta}=(1-\theta,\theta)\), taking values in \(\{0,10\}\))
One can prove that
while
When \(k=1\), observe that the distance is easy to compute when distributions are ordered
When \(k=2\), the two distances are not equal
In the Gaussian (and the Bernoulli) case, we can get an expression for the distance (and much more as we will see later on)
There are several representations for \(W_2\)
And finally, we can also discuss \(W_{\infty}\)
Wasserstein distances, and optimal transport
Wasserstein distance can also we written using some sort of expected values, when considering random variables instead of distributions, and some best-case scenario, or cheapest transportation cost,
which lead to the so call Kantorovich problem
An alternative way to look at this problem is to consider a transport map, and a push-forward measure
This is simply
Of course such mapping exist
We can then consider Monge problem
And interestingly, those two problems are (somehow) equivalent
Discrete case
If \(\boldsymbol{a}_{{A}}\in\mathbb{R}_+^{\color{red}{n_{{A}}}}\) and \(\boldsymbol{a}_{{B}}\in\mathbb{R}_+^{\color{blue}{n_{{B}}}}\), define\(U(\boldsymbol{a}_{{A}},\boldsymbol{a}_{{B}})=\big\lbrace M\in\mathbb{R}_+^{\color{red}{n_{{A}}}\times\color{blue}{n_{{B}}}}:M\boldsymbol{1}_{\color{blue}{n_{{B}}}}=\boldsymbol{a}_{A}\text{ and }{M}^\top\boldsymbol{1}_{\color{red}{n_{{A}}}}=\boldsymbol{a}_{B}\big\rbrace\)For convenience, let \(U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}\) denote \(\displaystyle{U\left(\boldsymbol{1}_{n_{{A}}},\frac{\color{red}{n_{{A}}}}{\color{blue}{n_{{B}}}}\boldsymbol{1}_{n_{{B}}}\right)}\) (so that \(U_{\color{red}{n},\color{blue}{n}}\) is the set of permutation matrices associated with \(\mathcal{S}_n\)). Let \(C_{i,j}=d(x_i,y_{j})^k\)so that \(W_k^k(\boldsymbol{x},\boldsymbol{y}) = \underset{P\in U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle \Big\rbrace\)where\(\langle P,C\rangle = \sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}C_{i,j}\) then consider \(P^* \in \underset{P\in U_{\color{red}{n_A},\color{blue}{n_B}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle \Big\rbrace\)For the slides, if we have the same sample sizes in the two groups, we have
we can illustrate below (with costs, or distances)
And with different group sizes,
i.e., if we consider real datasets
And as usual, we can consider some penalized version. Define \(\mathcal{E}(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\log P_{i,j}\)or\(\mathcal{E}'(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\big[\log P_{i,j}-1\big]\) or \(\mathcal{E}'(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\big[\log P_{i,j}-1\big]\) Define \(P^*_\gamma = \underset{P\in U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle -\gamma \mathcal{E}(P) \Big\rbrace\) The problem is strictly convex.
Sinkhorn relaxation
This idea is related to the following theorem
Consider a simple optimal transportation problem between 6 points to 6 other points,
set.seed(123)
x = (1:6)/7
y = runif(9)
x
[1] 0.14 0.29 0.43 0.57 0.71 0.86
y[1:6]
[1] 0.29 0.79 0.41 0.88 0.94 0.05
library(T4transport)
Wxy = wasserstein(x,y[1:6])
Wxy$plan
that we can visualize below (the first observation of \(\boldsymbol{x}\) is matched with the last one of \(\boldsymbol{y}\), the second observation of \(\boldsymbol{x}\) is matched with the first one of \(\boldsymbol{y}\), etc)
We observe that we simply match according to ranks.
But we can also use a penalized version
Sxy = sinkhorn(x, y[1:6], p = 2, lambda = 0.001)
Sxy$plan
here with a very small pernalty
or a larger one
Sxy = sinkhorn(x, y[1:6], p = 2, lambda = 0.05)
Sxy$plan
In the discrete case, optimal transport can be related to Hardy-Littlewood-Polya inequality, that is related to the idea of matching based on ranks (corresponding to a monotone mapping function)
We have then
In the bivariate dicrete case, we have
Optimal mapping
We have mentioned that, in the univariate setting
and clearly, \(\mathcal{T}^\star\) is increasing. In the Gaussian case, for example\(x_{{B}}=\mathcal{T}^\star(x_{{A}})= \mu_{{B}}+\sigma_{{B}}\sigma_{{A}}^{-1} (x_A-\mu_{{A}}).\)In the multivariate case, we need a more general concept of increasingness to define an “increasing” mapping \(\mathcal{T}^\star:\mathbb{R}^k\to\mathbb{R}^k\).
In the Gaussian case, for example, we have a linear mapping,\(\boldsymbol{x}_{{B}} = \mathcal{T}^\star(\boldsymbol{x}_{{A}})=\boldsymbol{\mu}_{{B}} + \boldsymbol{A}(\boldsymbol{x}_{{A}}-\boldsymbol{\mu}_{{A}})\)where \(\boldsymbol{A}\) is a symmetric positive matrix that satisfies \(\boldsymbol{A}\boldsymbol{\Sigma}_{{A}}\boldsymbol{A}=\boldsymbol{\Sigma}_{{B}},\) which has a unique solution given by \(\boldsymbol{A}=\boldsymbol{\Sigma}_{{A}}^{-1/2}\big(\boldsymbol{\Sigma}_{{A}}^{1/2}\boldsymbol{\Sigma}_{{B}}\boldsymbol{\Sigma}_{{A}}^{1/2}\big)^{1/2}\boldsymbol{\Sigma}_{{A}}^{-1/2},\) where \(\boldsymbol{M}^{1/2}\) is the square root of the square (symmetric) positive matrix \(\boldsymbol{M}\) based on the Schur decomposition (\(\boldsymbol{M}^{1/2}\) is a positive symmetric matrix). In R, for example, use the expm package,
M = expm::sqrtm(matrix(c(1,1.2,1.2,2),2,2))
M
[,1] [,2]
[1,] 0.8244771 0.5658953
[2,] 0.5658953 1.2960565
M %*% M
[,1] [,2]
[1,] 1.0 1.2
[2,] 1.2 2.0
Optimal mapping, on real data
To illustrate, it is possible to consider the optimal matching, between the height of \(n\) men and \(n\) women,
Another example (discussed in Optimal Transport for Counterfactual Estimation: A Method for Causal Inference – with a nice R notebook created by Ewen), consider Black and non-Black mothers in the U.S.
or the joint mapping, in dimension 2
We will spend more time on those functions (and the related concept) in a few weeks, when discussing barycenters and geodesics… More details in the slides (online) and in the forthcoming textbook,
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.