Eight R Tidyverse tips for everyday data engineering
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tidyverse is a collection of R packages, primarily for data engineering and analytics. These packages are ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats. And all combine the same language, design and “grammar” structures.
1. Use pipe “%>%” for nesting functions within the pipe
Piping (or chaining) is a great way to link the data manipulation functions without storing intermediate results.
# inner piping inside mutate airlines %>% mutate(name_short = name %>% str_to_upper() %>% str_replace_all (" (INC|CO)\\.?$", "") %>% str_replace_all (" AIR ?(LINES|WAYS)?( CORPORATION)?$", "") %>% str_to_title() %>% str_replace_all("\\bUs\b", "US") ) %>% mutate(FullName_length = nchar(name)) %>% select (name_short, FullName_length) %>% arrange(desc(FullName_length))
This code is clearly readable, understandable and easy to curate, despite the heavy use of REGEX and str_replace_all function. If we would replace the inner piping with normal function wrapping, the mutate for “name_short” would be fairly unreadable.
# without inner piping airlines %>% mutate(name_short = str_replace_all( str_to_title( str_replace_all(str_replace_all(str_to_upper(name)," (INC|CO)\\.?$", "")," AIR ?(LINES|WAYS)?( CORPORATION)?$", "")),"\\bUs\b","US")) %>% mutate(FullName_length = nchar(name)) %>% select (name_short, FullName_length) %>% arrange(desc(FullName_length))
Both code snippets return the same result and use essentially the same functions. Whereas, the first one intelligently does this, using pipes.
2. Operations across multiple columns at once
Tidyverse provides a handful of great functions for operating across multiple columns simultaneously. Across is a function, that makes it easy to apply the same transformation over numerous columns in summarise() and mutate() functions.
Across accepts two arguments; a) array of columns and b) function or list of functions to be applied to selected columns.
# Simple across with two selected columns flights %>% mutate(across(c(dep_delay, arr_delay), abs)) # Using function where to select columns flights %>% summarise(across(where(is.integer), n_distinct)) # Using where and purr-style lamba function flights %>% summarise(across(where(is.integer), ~ sum(.x, na.rm = TRUE))) # using list of functions, defining column names and removing NA flights %>% group_by(carrier) %>% summarise(across(ends_with("time"), list(AVG = mean, SD = sd, GrandTotal= ~ sum(is.na(.x))), na.rm=TRUE, .names = "{.col}.{.fn}")) %>% ungroup()
3. Case statement to create a column based on a condition
Creating case statements is often a required task and case_when() function enables simple, fast transformation. A new column is simply added on top of other tidyverse functions; such as group_by and count.
flights %>% group_by(carrier) %>% mutate(new_classification = case_when( (origin == "EWR") & (dep_delay <= 0) ~ "EWR with negative delay", (origin == "EWR") & (dep_delay > 0) ~ "EWR with positive delay", (origin == "JFK") ~ "Stats for JFK Airport", (origin == "LGA") & (air_time <= 220) ~ "La Guardia under 6 hours flights", TRUE ~ "La Guardia above 6 hours flights" )) %>% count(new_classification) %>% ungroup()
4. Using transmute
This function executes same functions as mutate — create new column. But transmute() adds new variable(s) and drop existing ones, where mutate() preserves the existing ones.
library(lubridate) # Combination of mutate and select flights %>% mutate(date = make_date(year, month, day), carrier, tailnum) %>% select(date, carrier, tailnum) # using transmute flights %>% transmute(date = make_date(year, month, day), carrier, tailnum)
The example above returns essentially the same result set, whereas transmute() is cleaner and easier to read. If one would remove the select() function from the first example, one would get the complete dataset, along with a new column.
5. Lumping levels
Lumping together levels for a given factor variable is an extremely powerful function when there are many levels but one is interested in only the top most frequent levels.
Given the vector of 15 letters and their frequency, one can immediately see, that out of 15 different letters, roughly 80% of the observations fall into 5 letters.
# using transmute x <- factor(rep(LETTERS[1:15], times = c(20,15,23,2,4,3,1,1,1,5,2,8,3,1,1))) x %>% table()
And to achieve this, one can use fct_lump_n() function or fct_lump_min().
x %>% fct_lump_n(5) %>% table() # or alternatively # x %>% fct_lump_min(5) %>% table() ## result: > A B C J L Other > 20 15 23 5 8 19
Now displaying the long tail of different levels with low frequency is grouped into “other” bucket, great for visualising data.
# using ggplot to show top 5 frequent carriers and "other" group flights %>% mutate(name = fct_lump_n(carrier, 5)) %>% count(name) %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(x=name, y=n)) + geom_col()
6. Generating all possible combinations
Generating all the possible combinations out of a set of variables is not your everyday scenario. But Tidyverse offers the function called crossing() or expand() to do the job.
Following example will create all possible combinations for variables age, status, values and temperature.
# pseudo sample crossing( age = c(30,40,50,60,70), status = c("New", "Used"), values = c("0-100EUR", "101-200E", "201-300", "301-400"), temperature = c(30,35,34) )
Working with flights dataset, let’s create all possible combinations for couple of selected variables. One would use function expand().
# generating all possible combinations flights %>% expand(origin, dest, dep_time, carrier)
But finding all existing and unique combinations
# getting all possible combinations that are present in dataset using crossing flights %>% select(origin, dest, dep_time, carrier) %>% crossing() # > 97,946 more rows # getting all possible combinations using expand and nesting flights %>% expand(nesting(origin, dest, dep_time,carrier)) # > 97,946 more rows
7. Reshaping data with pivot and spread
Another important task that will usually occur with data preparation is data reshaping.
First function one must (!) understand and be comfortable work with is pivot_wider(). This function takes the values from one variable (in this case variable: origin) and transpose (reshape / pivot) the data (dep_time and arr_time) for given statistics (average of values).
#pivot_wider flights %>% group_by(carrier) %>% select(origin, dep_time, arr_time) %>% pivot_wider( names_from = origin, values_from = c(dep_time,arr_time), values_fn = ~mean(.x, na.rm = TRUE), names_glue = "{origin}_{.value}" ) #check calculation for carrier UA and origin EWR using summarise flights %>% filter(carrier == 'UA' & origin == 'EWR') %>% group_by(carrier) %>% summarise( avg_dep_time = mean(dep_time, na.rm = TRUE) )
Contra to pivot_wider() is the function pivot_longer(). This function does the exact opposite. Wider function adds more columns and reduces the number of rows, longer function reduces the number of columns and creates new rows of data.
Let’s persist the dataframe from pivot_wider example and name the dataframe flights_wider.
#create and persist dataframe called: flights_wider flights_wider <- flights %>% group_by(carrier) %>% select(origin, dep_time, arr_time) %>% pivot_wider( names_from = origin, values_from = c(dep_time,arr_time), values_fn = ~mean(.x, na.rm = TRUE), names_glue = "{origin}_{.value}" ) # excluding carrier flights_wider %>% pivot_longer( !carrier, names_to = "origin", values_to = "time", values_drop_na = TRUE ) #or with cols parameter and defining the pattern for column selection flights_wider %>% pivot_longer( cols = ends_with("time"), names_to = "origin", values_to = "time", values_drop_na = TRUE )
8. Adding a running ID to your dataframe
The last tip is the easiest and yet can save tons of time. Adding a running ID to each row of the dataframe. This ID will create “uniqueness” overall attributes (columns) for a given row. This can save you a lot of filtering and writing code, whereas, one can say, give me a row with ID = 42.
# Associate ID with every row in dataset flights %>% mutate(running_id = row_number()) # or using tibble row_to_column function flights2 <- tibble::rowid_to_column(flights, ID)
This list is by no means the ultimate list and one could add many honourable mentions. But despite all that, the list has been curated over a longer period of time and this cherry-picking is based on work and the use of Tidyverse.
As always, the code is available at my Github repository.
Happy Tidyversing and stay healthy!
p.s.: Article was originally published on Medium.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.