Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Xiaotong Ding (Claire), With Greg Page
A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling
In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.
For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.
Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.
For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values
library(dplyr) library(caret) options(scipen=999) #display decimal values, rather than scientific notation data = read.csv("/Users/xiaotongding/Desktop/Page BA-WritingProject/hawaii.csv") dim(data) ## [1] 21523 74 data[data==""] <- NA nzv_vals <- nearZeroVar(data, saveMetrics = TRUE) dim(nzv_vals) ## [1] 74 4
- The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.
nzv_sorted <- arrange(nzv_vals, desc(freqRatio)) head(nzv_sorted)
|
freqRatio
<dbl>
|
percentUnique
<dbl>
|
zeroVar
<lgl>
|
nzv
<lgl>
|
---|---|---|---|---|
has_availability | 21522.000000 | 0.009292385 | FALSE | TRUE |
calculated_host_listings_count_shared_rooms | 521.634146 | 0.032523347 | FALSE | TRUE |
host_has_profile_pic | 282.184211 | 0.009292385 | FALSE | TRUE |
number_of_reviews_l30d | 26.545337 | 0.046461924 | FALSE | TRUE |
calculated_host_listings_count_private_rooms | 13.440804 | 0.097570041 | FALSE | FALSE |
room_type | 9.244102 | 0.018584770 | FALSE | FALSE |
The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):
table(sort(data$host_has_profile_pic, decreasing=TRUE)) ## ## f t ## 76 21446In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).
length(unique(data$license)) ## [1] 9768By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:
length(unique(data$license)) / nrow(data) * 100 ## [1] 45.38401For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:
head(nzv_sorted, 10)
|
freqRatio
<dbl>
|
percentUnique
<dbl>
|
zeroVar
<lgl>
|
nzv
<lgl>
|
---|---|---|---|---|
has_availability | 21522.000000 | 0.009292385 | FALSE | TRUE |
calculated_host_listings_count_shared_rooms | 521.634146 | 0.032523347 | FALSE | TRUE |
host_has_profile_pic | 282.184211 | 0.009292385 | FALSE | TRUE |
number_of_reviews_l30d | 26.545337 | 0.046461924 | FALSE | TRUE |
calculated_host_listings_count_private_rooms | 13.440804 | 0.097570041 | FALSE | FALSE |
room_type | 9.244102 | 0.018584770 | FALSE | FALSE |
review_scores_checkin | 7.764874 | 0.041815732 | FALSE | FALSE |
review_scores_location | 7.632574 | 0.041815732 | FALSE | FALSE |
maximum_nights_avg_ntm | 7.083577 | 6.095804488 | FALSE | FALSE |
minimum_maximum_nights | 7.018508 | 0.715513637 | FALSE | FALSE |
nzv_vals2 <- nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04) nzv_sorted2 <- arrange(nzv_vals2, desc(freqRatio)) head(nzv_sorted2, 10)
|
freqRatio
<dbl>
|
percentUnique
<dbl>
|
zeroVar
<lgl>
|
nzv
<lgl>
|
---|---|---|---|---|
has_availability | 21522.000000 | 0.009292385 | FALSE | TRUE |
calculated_host_listings_count_shared_rooms | 521.634146 | 0.032523347 | FALSE | TRUE |
host_has_profile_pic | 282.184211 | 0.009292385 | FALSE | TRUE |
number_of_reviews_l30d | 26.545337 | 0.046461924 | FALSE | FALSE |
calculated_host_listings_count_private_rooms | 13.440804 | 0.097570041 | FALSE | FALSE |
room_type | 9.244102 | 0.018584770 | FALSE | FALSE |
review_scores_checkin | 7.764874 | 0.041815732 | FALSE | FALSE |
review_scores_location | 7.632574 | 0.041815732 | FALSE | FALSE |
maximum_nights_avg_ntm | 7.083577 | 6.095804488 | FALSE | FALSE |
minimum_maximum_nights | 7.018508 | 0.715513637 | FALSE | FALSE |
Function With Special Talent from ‘caret’ package in R — NearZeroVar() was first posted on September 4, 2021 at 8:54 am.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.