Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The dplyr
package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself):
filter()
to select cases based on their values.arrange()
to reorder the cases.select()
andrename()
to select variables based on their names.mutate()
andtransmute()
to add new variables that are functions of existing variables.summarise()
to condense multiple values to a single value.sample_n()
andsample_frac()
to take random samples.
The “Two-table verbs” vignette gives a good introduction to using dplyr
function for joining two tables together. The “Window functions” vignette talks about, well, window functions, which are defined as functions which take n values and return n values (as opposed to aggregation functions, which take n values but return one value). The window functions mentioned there are (list taken from the vignette itself):
- Ranking and ordering functions:
row_number()
,min_rank()
,dense_rank()
,cume_dist()
,percent_rank()
, andntile()
. These functions all take a vector to order by, and return various types of ranks. - Offsets:
lead()
andlag()
allow you to access the previous and next values in a vector, making it easy to compute differences and trends. - Cumulative aggregates:
cumsum()
,cummin()
,cummax()
(from base R), andcumall()
,cumany()
, andcummean()
(from dplyr).
However, dplyr
comes with several other functions that are not mentioned in the vignettes (or at least, not at length). In this post I’ll talk about some of them. (For the full list of dplyr
functions, see the reference manual.)
- Counting functions:
n()
andn_distinct()
- If-else functions:
if_else()
andcase_when()
- Comparison functions:
between()
andnear()
- Selecting specific elements based on position:
nth()
,first()
andlast()
- Selecting specific rows based on position/value:
slice()
andtop_n()
- Utilities:
coalesce()
andpull()
To illustrate these functions, I will use the flights dataset in the nycflights13
package.
library(tidyverse) library(nycflights13) data(flights)
Counting functions: n()
and n_distinct()
Ok, these aren’t exactly lesser known functions but they’re certainly useful. n()
counts the number of rows in each group. For example, to count the number of flights in each month:
flights %>% group_by(month) %>% summarize(count = n()) # # A tibble: 12 x 2 # month count # <int> <int> # 1 1 27004 # 2 2 24951 # 3 3 28834 # 4 4 28330 # 5 5 28796 # 6 6 28243 # 7 7 29425 # 8 8 29327 # 9 9 27574 # 10 10 28889 # 11 11 27268 # 12 12 28135
n_distinct()
counts the number of unique values in each group. To count the number of distinct values of day
in the dataset:
flights %>% summarize(cnt = n_distinct(day)) # # A tibble: 1 x 1 # cnt # <int> # 1 31
as we expect, since the longest month only has 31 days. We can also count the number of unique sets of values across columns. To count the number of distinct (month, day)
values:
flights %>% summarize(cnt = n_distinct(month,day)) # # A tibble: 1 x 1 # cnt # <int> # 1 365
as expected (number of days in a year).
If-else functions: if_else()
and case_when()
if_else()
returns a value which depends on whether a given condition is true or not. It works like the base ifelse()
function, except that it is stricter in that the returned value must be of the same type, whether the given condition is true or not. if_else()
is commonly used to create new columns. For example, we could create a new column that is “United” if a flight was from United Airlines, “Other” otherwise:
flights %>% transmute(carrier = carrier, isUA = if_else(carrier == "UA", "United", "Otherwise")) %>% head() # # A tibble: 6 x 2 # carrier isUA # <chr> <chr> # 1 UA United # 2 UA United # 3 AA Otherwise # 4 B6 Otherwise # 5 DL Otherwise # 6 UA United
case_when()
is like if_else()
except that we have more than 2 possible values for the output. It is “the R equivalent of the SQL CASE WHEN statement”. If none of the cases match, an NA is returned. As with if_else()
, the returned values for all the cases must be of the same type.
Here is some code to return “United” if carrier is UA, “JetBlue” if it is B6, and “Other” for all other carriers (notice that the syntax within case_when()
is a little different from the usual, and how we use TRUE
to catch all the cases that don’t meet the first two conditions):
flights %>% transmute(carrier = carrier, newCol = case_when( carrier == "UA" ~ "United", carrier == "B6" ~ "JetBlue", TRUE ~ "Otherwise" )) %>% head() # # A tibble: 6 x 2 # carrier newCol # <chr> <chr> # 1 UA United # 2 UA United # 3 AA Otherwise # 4 B6 JetBlue # 5 DL Otherwise # 6 UA United
Comparison functions: between()
and near()
between(x, left, right)
is a shortcut for x >= left & x <= right
that has an efficient implementation. It makes for more concise code too, at the expense of being unable to tell if the endpoints are included or not unless one is familiar with the function.
near()
is a safer way than using ==
to compare if two vectors of floating point numbers are equal element-wise. This is because it uses tolerance which is based on the machine’s double precision. Here is the example provided in near()
‘s documentation:
sqrt(2) ^ 2 == 2 # [1] FALSE near(sqrt(2) ^ 2, 2) # [1] TRUE
Selecting specific elements based on position: nth()
, first()
and last()
nth(x, n)
returns the nth element in the vector x
. An optional vector to determine the order can be passed to the function as well. If n
is negative, we count from the end of x
(e.g. -2 will return the second last value in x
). first()
and last()
are obvious special cases of nth()
.
What’s interesting is that these functions are actually wrappers around [[
. This means that they can be used to pull out the nth element in a list as well!
Selecting specific rows based on position/value: slice()
and top_n()
slice()
allows us to select certain rows based on their ordinal position (i.e. row number). This works especially well with grouped dataframes. For example, if we want the first row for each date:
flights %>% group_by(month, day) %>% slice(1) %>% select(month, day, dep_time) # # A tibble: 365 x 3 # # Groups: month, day [365] # month day dep_time # <int> <int> <int> # 1 1 1 517 # 2 1 2 42 # 3 1 3 32 # 4 1 4 25 # 5 1 5 14 # 6 1 6 16 # 7 1 7 49 # 8 1 8 454 # 9 1 9 2 # 10 1 10 3 # # … with 355 more rows
A simple modification gives us the first two rows for each date:
flights %>% group_by(month, day) %>% slice(1:2) %>% select(month, day, dep_time) # # A tibble: 730 x 3 # # Groups: month, day [365] # month day dep_time # <int> <int> <int> # 1 1 1 517 # 2 1 1 533 # 3 1 2 42 # 4 1 2 126 # 5 1 3 32 # 6 1 3 50 # 7 1 4 25 # 8 1 4 106 # 9 1 5 14 # 10 1 5 37 # # … with 720 more rows
In contrast to slice()
which picks out rows by their position (within the group), top_n()
picks out rows by their value in a pre-specified column (within the group). For example, if we wanted to see the carriers for the top 3 longest departure delays for each day, we could use this code:
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% select(month, day, dep_delay, carrier) # # A tibble: 1,108 x 4 # # Groups: month, day [365] # month day dep_delay carrier # <int> <int> <dbl> <chr> # 1 1 1 853 MQ # 2 1 1 290 EV # 3 1 1 379 EV # 4 1 2 334 UA # 5 1 2 337 AA # 6 1 2 379 UA # 7 1 3 268 DL # 8 1 3 252 B6 # 9 1 3 291 9E # 10 1 4 208 B6 # # … with 1,098 more rows
Notice that while we get the top 3 rows in dep_delay
for each day, the rows are not sorted according to the dep_delay
column. To get the bottom three rows instead, use a negative number:
flights %>% group_by(month, day) %>% top_n(-3, dep_delay) %>% select(month, day, dep_delay, carrier) # # A tibble: 1,504 x 4 # # Groups: month, day [365] # month day dep_delay carrier # <int> <int> <dbl> <chr> # 1 1 1 -15 MQ # 2 1 1 -14 F9 # 3 1 1 -15 AA # 4 1 2 -13 AA # 5 1 2 -13 UA # 6 1 2 -12 AA # 7 1 2 -12 9E # 8 1 3 -12 EV # 9 1 3 -12 EV # 10 1 3 -13 B6 # # … with 1,494 more rows
Notice how Jan 2nd has 4 entries, not 3? That’s because there was a tie for 3rd place. top_n()
either takes all rows with a value or none of them. Another gotcha: if the column that you are sorting against is not specified, it defaults to the last variable in the data frame, NOT to ordinal position/row number.
Utilities: coalesce()
and pull()
From the documnetation: “Given a set of vectors, coalesce()
finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.”
I haven’t had to use this function myself so far, but I imagine it could be useful when you have a number of different data sources reporting the same thing but with different missing values in each source. Here is a contrived example:
x <- c(NA, NA, 3) y <- c(1, NA, 4) z <- c(NA, 2, NA) coalesce(x, y, z) # [1] 1 2 3
Notice that the third value in y
showing a 4 is ignored.
pull()
works like [[
: it allows you to pull out a particular column (as a vector). The variable you pass to the function can be a variable name, a positive integer (giving the column position when counting from the left) or a negative integer (giving the column position when counting from the right).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.