Performance Measures for Feature Selection
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a recent post, I have discussed performance measures for model selection. This time, I write about a related topic: performance measures that are suitable for selecting models when performing feature selection. Since feature selection is concerned with reducing the number of dependent variables, suitable performance measures evaluate the trade-off between the number of features, p, and the fit of the model.
Performance measures for regression
Mean squared error (MSE) and R2 are unsuited for comparing models during feature selection. According to these measures, a model whose set of features is a superset of the set of features from another model, always has a better performance. By using the adjusted R2 or Mallow’s Cp statistic, it is possible to consider both performance and number of features.
Adjusted R squared
Given estimates of the outcome ˆY and observed outcomes Y, the coefficient of determination can be defined as the square of Pearson’s correlation coefficient ρ:
R2=ρ2ˆY,Y=(Cov(ˆY,Y)σˆYσY)2.
For models with an intercept, R2 is in the range [0,1]. The adjusted R squared adjusts R2 according to degrees of freedom of the model, n–p−1:
R2adj=1–(1–R2)(n−1)n–p−1
The intuition behind R2 is the following:
- R2adj increases if the enumerator decreases, that is, if R2 is large
- R2adj increases if the denominator increases, that is, if p is small
When adding additional features to a model, R2adj only increases when added predictors sufficiently increase R2.
Adjusted R squared in R
The adjusted R squared can be directly obtained from the summary method of an lm
object:
set.seed(1501) N <- 50 y <- rnorm(N) set.seed(1001) y.hat <- y + runif(N, -1, 1) df.low <- data.frame(Y = y, Y_Hat = y.hat) model <- lm(Y ~ Y_Hat, data = df.low) adj.r.squared <- summary(model)$adj.r.squared
Mallow’s Cp
Mallow’s Cp statistic can be used to assess the fit of least-squares models during feature selection. For Gaussian models, it is identical to the Akaike Information Criterion. Small values of Cp that are close to the number of features are assigned to models with a good fit.
The Cp statistic assigns a value of p+1 for an ideal model, where p is the number of independent variables. If Cp>p+1, this means that the model is overspecified (i.e. contains too many variables). If Cp<p+1, then the model lacks fit (i.e. has a large bias). Assume that there are k available features and you are evaluating a model with p features, then the Cp statistic is defined as:
Cp=SSresMSEk−N+2p
where SSres is the residual sum of squares from the model with p features and MSEk is the mean-squared error of the model using all of the k features.
Computing the Cp statistic in R
In R, you can simply define a custom function for calculating the Cp statistic:
cp <- function(model.subset, model.full) { N <- nrow(model.subset$model) p <- length(model.subset$coefficients) rss.subset <- sum(residuals(model.subset)^2) mse.full <- mean(sum(residuals(model.full)^2)) c <- (rss.subset / mse.full) - N + (2 * p) return(c) }
To demonstrate the use of this function, we will use the airquality data set, for which we create three models using subsets of features:
data(airquality) ozone <- subset(na.omit(airquality), select = c("Ozone", "Solar.R", "Wind", "Temp")) m1 <- lm(Temp ~ Ozone, data = ozone) m2 <- lm(Temp + Wind ~ Ozone, data = ozone) m.full <- lm(Temp + Wind + Solar.R ~ Ozone, data = ozone)
We can now determine the Cp statistic for the three models:
c1 <- cp(m1, m.full) c2 <- cp(m2, m.full) c3 <- cp(m.full, m.full) print(paste0("Cps for three models: ", paste(round(c(c1, c2, c3), 3), collapse = ", "))) ## [1] "Cps for three models: -106.994, -106.993, -106"
Since Cp is closer to p with increasing number of features, it is worthwhile to use all three features. The negative values of Cp, however, indicate that the model is subject to a high bias.
Performance measures for classification
The Akaike information criterion (AIC) can be used for both, regression and classification. It is defined as
AIC=2p−2⋅ln(ˆL)
where ˆL is the maximum of the likelihood function. A desirable model minimizes the AIC because this is the model that has the best fit (high ˆL) with the fewest possible number of features (low p).
Calculating the AIC for generalized linear models
For regression models, the AIC is directly available from the summary function for glm
objects:
m1 <- glm(Temp ~ Ozone, data = ozone) m2 <- glm(Temp + Wind ~ Ozone, data = ozone) m.full <- glm(Temp + Wind + Solar.R ~ Ozone, data = ozone) aics <- c(summary(m1)$aic, summary(m2)$aic, summary(m.full)$aic) print(paste0("AICs are: ", paste(aics, collapse = ", "))) ## [1] "AICs are: 746.187677405374, 753.590711437736, 1310.33092056826"
In this case, the AIC indicates that the inclusion of the third feature does not provide a fit-complexity tradeoff.
Calculating the AIC more generally
Generally, the AIC can be calculated using the AIC
function for all models, for which a log likelihood is defined:
aics <- c(AIC(m1), AIC(m2), AIC(m.full)) print(paste0("AICs are: ", paste(aics, collapse = ", "))) ## [1] "AICs are: 746.187677405374, 753.590711437736, 1310.33092056826"
Alternatives to these performance measures
Rather than computing performance measures that take model complexity into account, you could also evaluate model performance on a test set (e.g. using cross validation) to prevent overfitting.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.