[This article was first published on mickeymousemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Factors are R’s enumerated type. Suppose you define the variable cities — a vector of strings — whose possible values are “New York,” “Paris,” “London” and “Beijing.” Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1=”New York”, 2=”Paris”, 3=”London”, 4=”Beijing”}, and store a vector of integers. Behold the factor:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2))
cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing"
cities.factorized <- as.factor(cities)
cities.factorized # Returns essentially the same as above
as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1
Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here’s an example of a logistic regression made simple using factors:
n <- 1000
df <- data.frame(x = runif(n, 0, 5))
ProbabilityOfYGivenX <- function(x) {
return((sin((x - 3) ^ 2) + 1.5) / 3)
}
df$y <- runif(n) < ProbabilityOfYGivenX(df$x)
df <- df[order(df$x), ]
dev.new(height=6, width=8)
plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)
points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)
legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)
legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)
savePlot("logit_data.png")
The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function:
# Define a factor that chops x into 20 disjoint buckets
df$f <- cut(df$x, breaks=c(-Inf, quantile(df$x, probs=((1:19)/20)), Inf))
# By construction we have (n / 20) = 50 observations in each bucket
summary(df$f)
logit.model <- glm(y ~ f, family=binomial("logit"), data=df)
df$y_hat <- predict(logit.model, newdata=df, type="response")
dev.new(height=6, width=8)
plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)
points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)
lines(df$x, df$y_hat, col="darkred", lty=2, lwd=1.5)
legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)
legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)
legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred")
savePlot("logit_data_and_estimate.png")
Et voilà. Nothing too fancy — this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df$y, df$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that’s a story for another time.
To leave a comment for the author, please follow the link and comment on their blog: mickeymousemodels.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.