Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cats are great. Perhaps Hadley Wickham and Lionel Henry think so too given the wonderful choice of name for their purrr package. Hadley Wickham has also created a superb package called forcats, likely an abbreviation of “for categoricals” but wittingly cat-themed, which is very, very useful to the data scientist.
In the data science profession, one spends a vast amount of time getting data into a clean, tidy format (aka preprocessing) for subsequent statistical analysis and modelling. If I were to estimate how much of my work life is spent doing this, I would say 70-80% at least. This is often the main difference between university and real world application. The datasets used in learning are nice and clean whereas those you encounter in the real world are far from it. I really enjoy data preprocessing and the problems that arise. I have found a handful of functions from the forcats package to be incredibly useful when working with categorical data, or factors, in R.
Prepare some demonstration data
A simple dataset is created using the following code. There are 50 NAs and 7 factor levels in the sales variable.
library(dplyr) # Also load up dplyr so we can use the pipe operator: %>% library(forcats) df <- data_frame(sales = factor(rep(c("Online", "Post", "Web", "Call Centre", "Inbound Phone", "Outbound Phone", "Field Sales", NA), 50)), buy = sample(c(0, 1), 400, replace = T)) %>% mutate(sales = sample(sales, size = length(sales), replace = T)) table(df$sales)
Dealing with missing data (NAs)
Missing values are a common occurrence in real world data. In an R dataframe, they need to be considered when performing statistical computations on continuous data such as calculating the mean, median, variance, standard deviation, etc. In addition, when you have a dataset from which you want to build a model you may need to treat NAs in order to retain particular variables and/or prevent the loss of data due to incomplete cases. Common strategies involve imputing the mean or median (continuous) or, if categorical, the mode. However, in some cases this is not appropriate and you may need to set NAs as an explicit factor level. With forcats::fct_explicit_na() we can achieve this in a one-liner. Let’s try it out on the sales variable in our test data….
df$sales <- fct_explicit_na(df$sales) table(df$sales)
The NAs are now represented by an explicit level and NA values are simply replaced by (Missing). The new level can be named whatever you like by passing a name to the na_level argument….
df$sales <- fct_explicit_na(df$sales, na_level = "My New Level")
Synonymous factor levels
Sometimes a categorical variable may have two or more factor levels that refer to the same group. There may be subtle differences in syntax such as upper case leading letter versus lower case leading letter (GroupA vs. groupA), for example. In this situation, one can use forcats::fct_collapse() to collapse the synonymous levels into one. In our test data, let’s assume that Web and Online refer to the same sales channel and we want to combine both into a factor level called Online….
df$sales <- fct_collapse(df$sales, Online = c("Online", "Web"))
Lumping infrequent factor levels into one level
Another situation can occur whereby one wants to analyse or model using groups of data where large sample sizes are necessary for statistical significance. Imagine a factor variable with 20 levels but just 5 of these account for over 90% of the observations in a dataset. You might drop these observations entirely but the loss of data should be avoided if possible. Alternatively, you could lump the infrequent levels into one level which covers them all, preserving the rest of the attributes associated with those observations and enabling you to tidy up the group levels. The function lumps all infrequent levels into the default level Other while ensuring that the count associated with this level remains the lowest of the levels. The user can also specify the number of levels to be kept after lumping by using the n argument. In our test data, the level Outbound Phone is the least frequent so it is lumped into the new level Other….
df$sales <- fct_lump(df$sales)
Reordering bars in a ggplot2 barplot
The tidyverse is a suite of packages which contains dplyr, ggplot2 and forcats among others. The packages make up a wonderful ecosystem of tools for a data scientist and I would like to demonstrate how one more function, forcats::fct_infreq(), can be used with ggplot2 during the exploratory data analyis/data presentation stage. Reordering factors in a barplot can be somewhat tricky but is made easy using the forcats package….
library(ggplot2) ggplot(df, aes(x = fct_infreq(sales))) + geom_bar()
The factor levels are now sorted by decreasing frequency and (Missing) and Other are included. We have lost none of the original data by carrying out some quick preprocessing using forcats. Happy days!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.