Binomial regression model

arthur charpentier

8 years ago

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, $Y_i\sim\mathcal{B}(p(\boldsymbol{X_i}))$ . This year (actually also the year before), I discuss extensions to multinomial regressions, $Y_i\sim\mathcal{M}(\boldsymbol{p}(\boldsymbol{X_i}))$ where $\boldsymbol{p}(\cdot)$ is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes

$\boldsymbol{p}(\boldsymbol{X_i})=(p_A(\boldsymbol{X_i}),p_B(\boldsymbol{X_i}),p_C(\boldsymbol{X_i}))\in\mathcal{S}_2$

the following model

$p_A(\boldsymbol{X_i})=\frac{\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

$p_B(\boldsymbol{X_i})=\frac{\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

and

$p_C(\boldsymbol{X_i})=\frac{1}{1+\exp[\boldsymbol{X}_i'\boldsymbol{\alpha}]+\exp[\boldsymbol{X}_i'\boldsymbol{\beta}]}$

Now, what about a real Binomial model, $Y_i\sim\mathcal{B}(N_i,p(\boldsymbol{X}_i))$ , where $N_i$ ‘s are known. How should we run such a regression model ? Consider the following dataset

> set.seed(1)
> n=100
> N=1+rpois(n,5)
> X1=runif(n)
> X2=rexp(n)
> s=X2-X1-2
> p=exp(s)/(1+exp(s))
> vY=NULL
> for(i in 1:n){
+ Y=rbinom(1,prob=p[i],size=N[i])
+ vY=c(vY,Y)
+ }
> db=data.frame(Y=vY,N=N,X1,X2)
> head(db,4)
  Y N        X1         X2
1 0 5 0.6547239 0.76318001
2 1 5 0.3531973 1.57271671
3 3 6 0.2702601 1.83564098
4 1 9 0.9926841 0.03715227

My first idea was to say that it should be simple since if (and only if)

where are i.i.d. random variables . So, a natural idea is to generate the dataset containing the ‘s

> vY=vX1=vX2=vN=NULL;
> for(i in 1:n){
+ vY=c(vY,c(rep(0,db$N[i]-db$Y[i]),rep(1,db$Y[i])))
+ vX1=c(vX1,rep(db$X1[i],db$N[i]))
+ vX2=c(vX2,rep(db$X2[i],db$N[i]))
+ }
> largedb=data.frame(Z=vY,X1=vX1,X2=vX2)
> head(largedb,16)
   Z        X1       X2
1  0 0.6547239 0.763180
2  0 0.6547239 0.763180
3  0 0.6547239 0.763180
4  0 0.6547239 0.763180
5  0 0.6547239 0.763180
6  0 0.3531973 1.572717
7  0 0.3531973 1.572717
8  0 0.3531973 1.572717
9  0 0.3531973 1.572717
10 1 0.3531973 1.572717
11 0 0.2702601 1.835641
12 0 0.2702601 1.835641
13 0 0.2702601 1.835641
14 1 0.2702601 1.835641
15 1 0.2702601 1.835641
16 1 0.2702601 1.835641

Then, we run a standard Bernoulli regression on those ‘s

> reg1=glm(Z~X1+X2,family=binomial,data=largedb)

But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – directly – a binomial regression, using the following syntax,

> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)

I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing

> coefficients(summary(reg1))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875231 -7.146399 8.908380e-13
X1          -0.9159275  0.3970303 -2.306946 2.105781e-02
X2           1.0564059  0.1360305  7.765952 8.103448e-15
> coefficients(summary(reg2))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875234 -7.146392 8.908817e-13
X1          -0.9159275  0.3970313 -2.306941 2.105813e-02
X2           1.0564059  0.1360310  7.765923 8.105285e-15

almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was

which will be simplified as

which is what we have in the second case. The use of the weight function will insure that the variance here are equal.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.