Piping data.tables
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator released in May 2021 it’s clear that chaining operations is now part of R vernacular.
So it’s no surprise that people often wonder how can you use pipes with data.table, as one participant of the recent data.table tutorial during LatinR 2023. The surprising answer is that data.table has supported pipelines since its inception in 2006. Furthermore, you can easily use either the magrittr or native pipes.
The data.table “pipe”
Instead of passing data to functions, data.table syntax is all about operating inside the [
operator1 .
DT[rows, columns, by]
Where DT
is a data.table object, the rows
argument is used for filtering and joining operations, the columns
argument can summarise and mutate, and the by
argument defines the groups to which to apply these operations.
So, to get only Chinstrap penguins from the penguins dataset, instead of using base::subset()
or dplyr::filter()
you would do
penguins_chinstrap <- penguins[species == "Chinstrap"]
Or, to get the mean mean flipper length of these penguins for each island and sex, you could summarise the data like this:
penguins_chinstrap[, .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
But because the output of the first operation is a data.table, you can add another [
operator after the first to chain both operations:
penguins[species == "Chinstrap"][, .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
I usually call this the ][
pipe.
You might have noticed that for just two operations, this line of code is already too long, so for even moderately long chains it’s usually advisable to put each operation in its own line. There’s some controversy on how to break the ][
pipe into lines and indent it. One option is to add a new line after the second [
, which has the advantage of actually writing the ][
pipe explicitly.
penguins[species == "Chinstrap" ][ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
A second options is to add the new line before the end of the first operation like so:
penguins[species == "Chinstrap" ][ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
Personally, I don’t like this syntax very much. No matter how you slice it, you always get what feels to me as incomplete lines. Also, RStudio doesn’t correctly indent the second syntax automatically.
Alternatively, the ][
pipe can go in its own line like so:
penguins[species == "Chinstrap" ][ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island) ]
This is indented correctly by RStudio and has the advantage of making easy to comment out each individual step:
penguins[species == "Chinstrap" ][ # , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island) ]
data.table and magrittr
Until the introduction of the native pipe, I used to write long data.table pipelines using magrittr. To do this, I took advantage of the .
placeholder which, within a magrittr pipe, refers to the result of the previous step.
library(magrittr) penguins[species == "Chinstrap"] %>% .[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
I really like this syntax as it’s very clean. Each line of code is a complete operation without dangling parts and it’s easy to comment out single steps.
The only downside is that the dot here has two meanings: as the placeholder for the previous result in .[
, and as an alias for list in .(mean_flipper_length = mean(flipper_length_mm))
. It’s not a huge issue, though, since I tend to read .[
as a single entity, but it can trip up some people.
Using the native pipe
The native pipe at first didn’t have a placeholder and it didn’t chaining to [
, so this so the above syntax wasn’t directly applicable. But you could cheat by creating an alias for [
and use that alias as a regular function. So this works:
DT <- `[` penguins[species == "Chinstrap"] |> DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
This worked so well that data.table officially added the DT()
function (currently only in the development version), so if you’re using the latest development version you don’t even need the first line2.
This syntax is fine but I don’t like that I need ro write one more character and the closing character being a )
can get confusing because it adds to the closing )
that you usually have in the by
argument.
From R 4.3.0 onwards, the native pipe supports a _
placeholder to the right-hand side fo the pipe. So now the magrittr syntax can be directly translated to
penguins[species == "Chinstrap"] |> _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
I like this syntax even more than the original magrittr one because it solves the double meaning problem and operations get hugged by a pair of brackets.
The four pipes of data.table
So, there you are, 4 different ways you can pipe your data.tables.
Use the ][
pipe if you want your code to have minimal dependencies and work in older versions of R. Use the %>%
pipe if you want your code to work in older versions of R and don’t mind the extra dependency. Use any version of the |>
pipe if you want minimal dependencies and don’t mind depending on R >= 4.3.0.
penguins[species == "Chinstrap" ][ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] penguins[species == "Chinstrap"] %>% .[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] penguins[species == "Chinstrap"] |> DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)) penguins[species == "Chinstrap"] |> _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
Image by storyset on Freepik
Footnotes
Technically, the
[
operator is itself a function, but the syntax is not function-like.↩︎This function also allows the user to use data.table syntax to data.frames and tibbles retaining the class of the output.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.