Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
One of the most common comments I hear is that logistic regression (also called Binomial regression) is some kind of “advanced magic”, “machine learning”, “artificial intelligence” or “big data”. This is not true.
In this post, I will show you how logistic regression works and why it is not as complex as some people think.
< section id="explanation" class="level2">Explanation
Logistic regression consists in transforming the data and then using linear regression (or ordinary least squares) repeated multiple times until achieving convergence.
The differences with the classic linear model are:
- It uses a binary dependent variable (i.e., heads or tails when we flip a coin, which we can express as 0/1). The linear model takes a continuous variable (i.e., the temperature in a city) as the dependent variable.
- It uses an exponential function to transform the data before fitting a regression.
- It returns estimated probabilities (i.e., 0.42, 0.91, …, 0.47) that we can convert to 0/1 by using a threshold (i.e., all the values over 0.5 become 1).
Before we continue, and in case you want to read more, the Logistic regression is a part of the Generalized Linear Models (GLM) family. This family includes other models like Gaussian (i.e., “normal” or “classic” regression), Poisson, Gamma, and others that have the exponential function in their formula. Some books such as Casella and Berger refer to them as the “exponential family”.
< section id="worked-example" class="level2">Worked example
In base R, we can use the glm
function to fit a logistic regression model. Let’s use the mtcars
dataset to predict if a car has automatic transmission (am
, coded as 0 for automatic and 1 for manual) as a function of the miles per gallon (mpg
).
R commands follow the previous explanation about a general family of models. We will use the family
argument to specify the type of model we want to fit, which in this case is binomial
.
coef(glm(am ~ mpg, data = mtcars, family = binomial))
(Intercept) mpg -6.6035267 0.3070282
Leaving the statistical significance aside, the second estimated coefficient (0.307) indicates that more miles per gallon are associated with a higher probability of having a manual transmission.
We can predict the probability of having a manual transmission for the cars in the dataset.
am_pred <- predict(glm(am ~ mpg, data = mtcars, family = binomial), type = "response") head(am_pred)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive 0.4610951 0.4610951 0.5978984 0.4917199 Hornet Sportabout Valiant 0.2969009 0.2599331
To convert this probability to a binary outcome, we can use a threshold of 0.6 (arbitrary).
am_pred_binary <- ifelse(am_pred > 0.6, 1, 0) head(am_pred_binary)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive 0 0 0 0 Hornet Sportabout Valiant 0 0
This can be compared with the actual data.
head(mtcars$am)
[1] 1 1 1 0 0 0
For a better comparison, we can use a confusion matrix, which shows the number of true positives, true negatives, false positives, and false negatives.
table(am_pred_binary, mtcars$am)
am_pred_binary 0 1 0 18 7 1 1 6
This matrix shows that the proposed model is not good at predicting the transmission type of the cars in the dataset. There are eight cases (out of 32) where the model predicted the incorrect transmission type, which can be fixed by adding more variables to the model, but that would be for another post.
It is possible to obtain the same coefficients from the glm()
function by transforming the data following the Binomial regression “recipe” and then using lm()
repeated times until reaching convergence. However, this is very unpractical and just for explanatory purposes.
# original variables y <- mtcars$am x <- mtcars$mpg # apply the recipe: define the new variables mu <- (y + 0.5) / 2 eta <- log(mu / (1 - mu)) z <- eta + (y - mu) / mu # iterate with initial values for the difference and the sum of sq residuals dif <- 1 rss1 <- 1 tol <- 1e-10 while (abs(dif) > tol) { fit <- lm(z ~ x, weights = mu) eta <- z - fit$residuals mu <- exp(eta) / (1 + exp(eta)) z <- eta + (y - mu) / mu res <- y - mu rss2 <- sum(res^2) dif <- rss2 - rss1 rss1 <- rss2 } coef(lm(z ~ x, weights = mu))
(Intercept) x -6.6035267 0.3070282
That’s it, in a few lines we wrote a straightforward code to fit a logistic regression. This is not magic, machine learning, or artificial intelligence. It is just a linear model repeated multiple times and it is tractable.
< section id="if-you-liked-this-post" class="level2">If you liked this post
I wrote the The Hitchhiker’s Guide to Linear Models to covers regressions starting with high school math and then all the explanations until reaching Binomial, Poisson and Tobit models. You can get it for free or paying a suggested price of 10 USD.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.