Multivariate probit regression using (direct) maximum likelihood estimators
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- The Gaussian vector latent structure
As in standard probit models, assume that
where we can assume that is a Gaussian random vector. This assumption can be used to derive the likelihood of a sample .
> logV=function(parameter){ + CORRELATION=parameter[1] + BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X)) + z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,])) + sigma=matrix(c(1,CORRELATION,CORRELATION,1),2,2) + a11=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma) + for(i in 2:nrow(z)){a11=c(a11,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))} + a10=pnorm(z[1,1],sd=sqrt(sigma[1,1]))-pmnorm(z[1,],varcov=sigma) + for(i in + 2:nrow(z)){a10=c(a10,pnorm(z[i,1],sd=sqrt(sigma[1,1]))-pmnorm(z[i,],varcov=sigma))} + a01=pnorm(z[1,2],sd=sqrt(sigma[2,2]))-pmnorm(z[1,],varcov=sigma) + for(i in + 2:nrow(z)){a01=c(a01,pnorm(z[i,2],sd=sqrt(sigma[2,2]))-pmnorm(z[i,],varcov=sigma))} + a00=1-a10-a01-a11 + -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) + + ((Y[,1]==0)&(Y[,2]==1))*log(a01) + + ((Y[,1]==1)&(Y[,2]==0))*log(a10) + + ((Y[,1]==0)&(Y[,2]==0))*log(a00) ) + } > OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par(the code is a bit long since I had trouble working properly with matrices – or more precisely to vectorize my functions – so I used loops… I am sure it is possible to write a better code).
It is possible to generate samples (based on that specific model) to check that we can actually derive proper maximum likelihood estimators,
> library(mnormt) > set.seed(1) > n=1000 > r=0.5 > X1=runif(n) > X2=rnorm(n) > Y1S=1+5*X1 > Y2S=8-5*X1 > RES=rmnorm(n,mean=c(0,0),varcov=matrix(c(1,r,r,1),2,2)) > YS=cbind(Y1S,Y2S)+RES > Y1=(YS[,1]>quantile(YS[,1],.5))*1 > Y2=(YS[,2]>quantile(YS[,2],.5))*1 > base=data.frame(i,Y1,Y2,X1,X2,YS) > head(base) i Y1 Y2 X1 X2 Y1S Y2S 1 1 0 0 0.2655087 0.07730312 3.177587 5.533884 2 2 0 0 0.3721239 -0.29686864 1.935307 5.089524 3 3 1 0 0.5728534 -1.18324224 4.757848 5.172584 4 4 1 0 0.9082078 0.01129269 4.600029 3.878225 5 5 0 1 0.2016819 0.99160104 2.547362 6.743714 6 6 1 0 0.8983897 1.59396745 5.309974 4.421523
(the two columns on the right are latent observations, that cannot be used since theoretically they are unobservable). Note that it is a simple regression, one of the component is here only to bring some noise. First of all, let us look at marginal probit regression
> reg1=glm(Y1~X1+X2,data=base,family=binomial) > reg2=glm(Y2~X1+X2,data=base,family=binomial) > summary(reg1) Call: glm(formula = Y1 ~ X1 + X2, family = binomial, data = base) Deviance Residuals: Min 1Q Median 3Q Max -2.90570 -0.50126 -0.00266 0.49162 2.78256 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.291725 0.267149 -16.065 <2e-16 X1 8.656836 0.510153 16.969 <2e-16 *** X2 0.007375 0.090530 0.081 0.935 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 999 degrees of freedom Residual deviance: 726.48 on 997 degrees of freedom AIC: 732.48 Number of Fisher Scoring iterations: 5 > summary(reg2) Call: glm(formula = Y2 ~ X1 + X2, family = binomial, data = base) Deviance Residuals: Min 1Q Median 3Q Max -2.74682 -0.51814 -0.00001 0.57969 2.58565 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.91709 0.24399 16.054 <2e-16 *** X1 -7.89703 0.46277 -17.065 <2e-16 *** X2 0.18360 0.08758 2.096 0.036 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 999 degrees of freedom Residual deviance: 777.61 on 997 degrees of freedom AIC: 783.61 Number of Fisher Scoring iterations: 5
Here, the optimization yields,
> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par > OPT[1] [1] 0.5261382 > matrix(OPT[2:7],2,3) [,1] [,2] [,3] [1,] -2.451721 4.908633 0.01600769 [2,] 2.241962 -4.539946 0.10614807
Note that the coefficients we have obtained are almost identical to the ones obtained with R standard procedure,
> library(Zelig) > REG= zelig(list(mu1=Y1~X1+X2, + mu2=Y2~X1+X2, + rho=~1), + model="bprobit",data=base) > summary(REG) Call: zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2, rho = ~1), model = "bprobit", data = base) Pearson Residuals: Min 1Q Median 3Q Max probit(mu1) -10.5442 -0.377243 0.0041803 0.36709 8.60398 probit(mu2) -7.8547 -0.376888 0.0083715 0.42923 5.88264 rhobit(rho) -13.8322 -0.091502 -0.0080544 0.37218 0.85101 Coefficients: Value Std. Error t value (Intercept):1 -2.451699 0.135369 -18.11116 (Intercept):2 2.241964 0.125072 17.92536 (Intercept):3 1.169461 0.189771 6.16249 X1:1 4.908617 0.252683 19.42602 X1:2 -4.539951 0.233632 -19.43203 X2:1 0.015992 0.050443 0.31703 X2:2 0.106154 0.049092 2.16235 Number of linear predictors: 3 Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho) Dispersion Parameter for binom2.rho family: 1 Residual Deviance: 1460.355 on 2993 degrees of freedom Log-likelihood: -730.1774 on 2993 degrees of freedom Number of Iterations: 3 > matrix(coefficients(REG)[c(1:2,4:7)],2,3) [,1] [,2] [,3] [1,] -2.451699 4.908617 0.01599183 [2,] 2.241964 -4.539951 0.10615443
> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1) [1] 0.5260951
That procedure works well an can be extended to ordinal responses (not only binary ones, or to three dimensional problems,
logV=function(beta){ BETA=matrix(beta[4:(3+ncol(Y)*ncol(X))],ncol(Y),ncol(X)) z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]),X%*%(BETA[3,])) r12=beta[1] r23=beta[2] r31=beta[3] s1=s2=s3=1 sigma=matrix(c(s1^2,r12*s1*s2,r31*s1*s3, r12*s1*s2,s2^2,r23*s2*s3, r31*s1*s3,r23*s2*s3,s3^2),3,3) sigma1=matrix(c(s2^2,r23*s2*s3, r23*s2*s3,s3^2),2,2) sigma2=matrix(c(s1^2,r31*s1*s3, r31*s1*s3,s3^2),2,2) sigma3=matrix(c(s1^2,r12*s1*s2, r12*s1*s2,s2^2),2,2) a111=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma) for(i in 2:nrow(z)){a111=c(a111,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))} a011=pmnorm(z[1,2:3],varcov=sigma1)-pmnorm(z[1,],varcov=sigma) for(i in 2:nrow(z)){a011=c(a011,pmnorm(z[i,2:3],varcov=sigma1)-pmnorm(z[i,],varcov=sigma))} a101=pmnorm(z[1,c(1,3)],varcov=sigma2)-pmnorm(z[1,],varcov=sigma) for(i in 2:nrow(z)){a101=c(a101,pmnorm(z[i,c(1,3)],varcov=sigma2)-pmnorm(z[i,],varcov=sigma))} a110=pmnorm(z[1,1:2],varcov=sigma3)-pmnorm(z[1,],varcov=sigma) for(i in 2:nrow(z)){a110=c(a110,pmnorm(z[i,1:2],varcov=sigma3)-pmnorm(z[i,],varcov=sigma))} a100=pnorm(z[1,1],sd=s1)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma) for(i in 2:nrow(z)){a100=c(a100,pnorm(z[i,1],sd=s1)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))} a010=pnorm(z[1,2],sd=s2)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(2,3)],varcov=sigma1)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma) for(i in 2:nrow(z)){a010=c(a010,pnorm(z[i,2],sd=s2)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(2,3)],varcov=sigma1)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))} a001=pnorm(z[1,3],sd=s3)-pmnorm(z[1,c(2,3)],varcov=sigma1)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma) for(i in 2:nrow(z)){a001=c(a001,pnorm(z[i,3],sd=s3)-pmnorm(z[i,c(2,3)],varcov=sigma1)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))} a000=1-a111-a011-a101-a110-a001-a010-a100 a111[a111<=0]=1e-50 a110[a110<=0]=1e-50 a101[a101<=0]=1e-50 a011[a011<=0]=1e-50 a100[a100<=0]=1e-50 a010[a010<=0]=1e-50 a001[a001<=0]=1e-50 a000[a000<=0]=1e-50 -sum(((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==0))*log(a111) + ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==0))*log(a011) + ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==0))*log(a101) + ((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==1))*log(a110) + ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==0))*log(a001) + ((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==1))*log(a010) + ((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==1))*log(a100) + ((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==1))*log(a000) ) }
A strong assumption in that bivariate model is that residuals have a Gaussian structure. It is possible to change that assumption
- marginally: for instance if we use a logistic cumulative distribution function, then we will have a bivariate logit regression
- in terms of dependence structure: it is possible to consider another copula than the gaussian one, e.g. Gumbel's copula (also called the bivariate logistic copula), or Clayton's
> F=function(x,r){pmnorm(x,rep(0,length(x)), + varcov=matrix(c(1,r,r,1),2,2))} > Fx=function(x1){F(c(x1,1e40),0)} > Fy=function(x2){Fx(x2)} > > logVgen=function(parameter){ + CORRELATION=parameter[1] + BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X)) + z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,])) + a11=F(z[1,],r=CORRELATION) + for(i in 2:nrow(z)){a11=c(a11,F(z[i,],r=CORRELATION))} + a10=Fx(z[1,1])-F(z[1,],r=CORRELATION) + for(i in 2:nrow(z)){a10=c(a10,Fx(z[i,1])-F(z[i,],r=CORRELATION))} + a01=Fy(z[1,2])-F(z[1,],r=CORRELATION) + for(i in 2:nrow(z)){a01=c(a01,Fy(z[i,2])-F(z[i,],r=CORRELATION))} + a00=1-a10-a01-a11 + -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) + + ((Y[,1]==0)&(Y[,2]==1))*log(a01) + + ((Y[,1]==1)&(Y[,2]==0))*log(a10) + + ((Y[,1]==0)&(Y[,2]==0))*log(a00) ) + } > > beta0=c(0,1,1,1,1,1,1) > (OPT=optim(fn=logVgen,par=beta0,method="BFGS")$par) [1] 0.52613820 -2.45172059 2.24196154 4.90863292 -4.53994592 0.01600769 [7] 0.10614807 There were 23 warnings (use warnings() to see them)
> library(copula) > F=function(x,r){pcopula(pnorm(x), claytonCopula(2, r))} > Fx=function(x1){F(c(x1,1e40),0)} > Fy=function(x2){Fx(x2)}
- An application to school tests
Consider the following dataset,
hsb2=read.table("http://freakonometrics.free.fr/hsb2.csv", header=TRUE, sep=",") math_male=hsb2$math[female==0] write_male=hsb2$write[female==0] math_female=hsb2$math[female==1] write_female=hsb2$write[female==1] plot(math_female, write_female, type="p", pch=19,col="red",xlab="maths",ylab="writing",cex=.8) points(math_male, write_male, cex=1.2, col="blue")with here maths versus writing, with girls in red and boys in blue, where variables here are
female : 0: male 1: female race : 1: hispanic 2: asian 3: african-amer 4: white ses : 1: low 2: middle 3: high schtyp : type of school 1: public 2: private prog : type of program 1: general 2: academic 3: vocation read : reading score write : writing score math : math score science : science score socst : social studies scoreWe can try to understand correlation between math and writing skills. Covariates can be the sex of the child, and his reading skills. The question will then be: are good students in maths and writing simply students that can read well ?
Here the code is simply
> W=hsb2$write>=50 > M=hsb2$math>=50 > base=data.frame(Y1=W,Y2=M, + X1=hsb2$female,X2=hsb2$read) > > library(Zelig) > REG= zelig(list(mu1=Y1~X1+X2, + mu2=Y2~X1+X2, + rho=~1), + model="bprobit",data=base) > summary(REG) Call: zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2, rho = ~1), model = "bprobit", data = base) Pearson Residuals: Min 1Q Median 3Q Max probit(mu1) -4.7518 -0.502594 0.15038 0.53038 1.8592 probit(mu2) -3.4243 -0.653537 0.23673 0.67011 2.6072 rhobit(rho) -4.9821 0.010481 0.13500 0.40776 2.9171 Coefficients: Value Std. Error t value (Intercept):1 -5.484711 0.787101 -6.96825 (Intercept):2 -4.061384 0.633781 -6.40818 (Intercept):3 1.332187 0.322175 4.13497 X1:1 1.125924 0.233550 4.82092 X1:2 0.167258 0.202498 0.82598 X2:1 0.103997 0.014662 7.09286 X2:2 0.082739 0.012026 6.88017 Number of linear predictors: 3 Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho) Dispersion Parameter for binom2.rho family: 1 Residual Deviance: 364.51 on 593 degrees of freedom Log-likelihood: -182.255 on 593 degrees of freedom Number of Iterations: 3 > (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1) [1] 0.5824045with a remaining correlation among residuals of 0.58. So with only the sex of the student, and his or her reading skill, we cannot explain the correlation between maths and writing skills. With our previous code, we have here
> beta0=c((exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1), + summary(REG)@coef3[c(1:2,4:7),1]) > beta0 (Intercept):1 (Intercept):2 X1:1 X1:2 0.58240446 -5.48471133 -4.06138412 1.12592427 0.16725842 X2:1 X2:2 0.10399668 0.08273879 > (OPT=optim(fn=logV,par=beta0,method="BFGS")$par) (Intercept):1 (Intercept):2 X1:1 X1:2 0.5824045 -5.4847113 -4.0613841 1.1259243 0.1672584 X2:1 X2:2 0.1039967 0.0827388
i.e. we obtain (almost) exactly the same estimators. But here I have used as starting values for the optimization procedure the estimators given by R. If we change them, hopefully we have a robust maximum likelihood estimator,
> (OPT=optim(fn=logV,par=beta0/2,method="BFGS")$par) (Intercept):1 (Intercept):2 X1:1 X1:2 0.58233360 -5.49428984 -4.06839571 1.12696594 0.16760347 X2:1 X2:2 0.10417767 0.08287409 There were 12 warnings (use warnings() to see them)So once again, it is possible to optimize numerically a likelihood function, and it works.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.