Binomial regression model
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, . This year (actually also the year before), I discuss extensions to multinomial regressions, where is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes
the following model
and
Now, what about a real Binomial model, , where ‘s are known. How should we run such a regression model ? Consider the following dataset
> set.seed(1) > n=100 > N=1+rpois(n,5) > X1=runif(n) > X2=rexp(n) > s=X2-X1-2 > p=exp(s)/(1+exp(s)) > vY=NULL > for(i in 1:n){ + Y=rbinom(1,prob=p[i],size=N[i]) + vY=c(vY,Y) + } > db=data.frame(Y=vY,N=N,X1,X2) > head(db,4) Y N X1 X2 1 0 5 0.6547239 0.76318001 2 1 5 0.3531973 1.57271671 3 3 6 0.2702601 1.83564098 4 1 9 0.9926841 0.03715227
My first idea was to say that it should be simple since if (and only if)
where are i.i.d. random variables . So, a natural idea is to generate the dataset containing the ‘s
> vY=vX1=vX2=vN=NULL; > for(i in 1:n){ + vY=c(vY,c(rep(0,db$N[i]-db$Y[i]),rep(1,db$Y[i]))) + vX1=c(vX1,rep(db$X1[i],db$N[i])) + vX2=c(vX2,rep(db$X2[i],db$N[i])) + } > largedb=data.frame(Z=vY,X1=vX1,X2=vX2) > head(largedb,16) Z X1 X2 1 0 0.6547239 0.763180 2 0 0.6547239 0.763180 3 0 0.6547239 0.763180 4 0 0.6547239 0.763180 5 0 0.6547239 0.763180 6 0 0.3531973 1.572717 7 0 0.3531973 1.572717 8 0 0.3531973 1.572717 9 0 0.3531973 1.572717 10 1 0.3531973 1.572717 11 0 0.2702601 1.835641 12 0 0.2702601 1.835641 13 0 0.2702601 1.835641 14 1 0.2702601 1.835641 15 1 0.2702601 1.835641 16 1 0.2702601 1.835641
Then, we run a standard Bernoulli regression on those ‘s
> reg1=glm(Z~X1+X2,family=binomial,data=largedb)
But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – directly – a binomial regression, using the following syntax,
> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)
I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing
> coefficients(summary(reg1)) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0547550 0.2875231 -7.146399 8.908380e-13 X1 -0.9159275 0.3970303 -2.306946 2.105781e-02 X2 1.0564059 0.1360305 7.765952 8.103448e-15 > coefficients(summary(reg2)) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0547550 0.2875234 -7.146392 8.908817e-13 X1 -0.9159275 0.3970313 -2.306941 2.105813e-02 X2 1.0564059 0.1360310 7.765923 8.105285e-15
almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was
which will be simplified as
which is what we have in the second case. The use of the weight function will insure that the variance here are equal.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.