How to: Binomial regression models in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We’ve been doing a lot of work recently with multinomial logit choice models; that is, trying to predict how choices are made between multiple options. It’s a fairly new topic for me so I’m trying to get the hang of both the theory and the practical bits. But learning multinomial modelling before binomial modelling (the choice between two options) is like trying to run before you can walk. (Normal linear regression would be crawling.)
Binomial models are easy to do in R. Just feed your independent and response variables into the glm
function and specify the “binomial” regression family.
# Create some data n <- 500 x1 <- runif(n,0,100) x2 <- runif(n,0,100) y <- (x2 - x1 + rnorm(n,sd=20)) < 0 # Fit a binomial regression model model <- glm(y ~ x1 + x2, family="binomial")
Without getting into the theory, this model estimates the logit z as a linear function of the independent variables.
This value can then be used to calculate the probability of a given outcome via the logistic function:
Normally with a regression model in R, you can simply predict new values using the predict
function. The problem with a binomial model is that the model estimates the probability of success or failure. So, for a given set of data points, if the probability of success was 0.5, you would expect the predict
function to give TRUE
half the time and FALSE
the other half.
This is kind of what happens in R but it doesn’t predict these TRUE
/FALSE
outcomes directly, so you need to know how to convert the prediction outcome into the actual binary response variable. If you use the default predict
arguments, the model gives you the z values. One (wrong) approach is to interpret z less than 0 as success or failure depending on how you’ve coded your model. But what you really want to do is estimate the probability, then draw a random number and see if that’s above or below the threshold value. R’s predict
function can do this but it will take a couple lines of code:
# Create some new data n2 <- 100 new.df <- data.frame(x1 = runif(n2, 0, 100), x2 = runif(n2,0,100)) # Predict the probabilities # predict(model, new.df) will just give you the logit (z) values # We want the probabilities so use the "response" type probs <- predict(model, new.df, "response") # Draw some random values between 0 and 1 draws <- runif(n2) # Check if draw value is less than threshold probability results <- (draws < probs)
So if you want to predict the results again, simply rerun the last two lines. New random draws will mean new outcomes.
To illustrate all this, I’ve put together a quick ggplot picture to show the original data (small points) and the modelled results (larger points). The colours represent the success/failure outcomes. (ggplot doesn’t want to add the legends for some reason.)
Next time I’ll look at the more complicated multinomial case.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.