Site icon R-bloggers

Logistic Regression & Factors in R

[This article was first published on mickeymousemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Factors are R’s enumerated type. Suppose you define the variable cities — a vector of strings — whose possible values are “New York,” “Paris,” “London” and “Beijing.” Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1=”New York”, 2=”Paris”, 3=”London”, 4=”Beijing”}, and store a vector of integers. Behold the factor:

cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2)) cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing" cities.factorized <- as.factor(cities) cities.factorized # Returns essentially the same as above as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1
Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here’s an example of a logistic regression made simple using factors:

n <- 1000 df <- data.frame(x = runif(n, 0, 5)) ProbabilityOfYGivenX <- function(x) {   return((sin((x - 3) ^ 2) + 1.5) / 3) } df$y <- runif(n) < ProbabilityOfYGivenX(df$x) df <- df[order(df$x), ] dev.new(height=6, width=8) plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2) points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5) legend("topleft", "Pr[Y=1|X]", bty="n", lty=1) legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5) savePlot("logit_data.png")

The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function:

# Define a factor that chops x into 20 disjoint buckets df$f <- cut(df$x, breaks=c(-Inf, quantile(df$x, probs=((1:19)/20)), Inf)) # By construction we have (n / 20) = 50 observations in each bucket summary(df$f) logit.model <- glm(y ~ f, family=binomial("logit"), data=df) df$y_hat <- predict(logit.model, newdata=df, type="response") dev.new(height=6, width=8) plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2) points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5) lines(df$x, df$y_hat, col="darkred", lty=2, lwd=1.5) legend("topleft", "Pr[Y=1|X]", bty="n", lty=1) legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5) legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred") savePlot("logit_data_and_estimate.png")

Et voilà. Nothing too fancy — this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df$y, df$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that’s a story for another time.

To leave a comment for the author, please follow the link and comment on their blog: mickeymousemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.