Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
tidyfast
Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil
Maintainer: Tyson S. Barrett (t.barrett88@gmail.com)
The goal of tidyfast
is to provide fast and efficient alternatives to some tidyr
(and a few dplyr
) functions using data.table
under the hood. Each have the prefix of dt_
to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr
(but notably does not use the lazy_dt()
framework of dtplyr
). This package imports data.table
and cpp11
(no other dependencies). These are, in essence, translations from a more tidyverse grammar to data.table
. Most functions herein are in places where, in my opinion, the data.table
syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.
Relationship with data.table
tidyfast
was designed to be an extension to and translation of data.table
. As such, there are three main ways tidyfast
is related to data.table
.
- This package is built directly on
data.table
using direct calls to[.data.table
and other functions under the hood. - It only relies on two packages,
cpp11
anddata.table
both stable packages that are unlikely to have breaking changes often. This follows thedata.table
principle of few dependencies. - It was designed to also show how others can use
data.table
within their own package to create functions that flexibly calldata.table
in complex ways.
Overview
As shown on the tidyfast
GitHub page, tidyfast
has several functions that have the prefix dt_
. A few notable functions from the package are shown below.
library(tidyfast) library(data.table) library(magrittr)
dt_fill
Filling NAs is a useful function but tidyr::fill()
, especially when done by many, many groups can become too slow. dt_fill()
is useful for this and can be used a few different ways.
x = 1:10 dt_with_nas <- data.table( x = x, y = shift(x, 2L), z = shift(x, -2L), a = sample(c(rep(NA, 10), x), 10), id = sample(1:3, 10, replace = TRUE) ) # Original dt_with_nas
x y z a id <int> <int> <int> <int> <int> 1: 1 NA 3 NA 1 2: 2 NA 4 NA 3 3: 3 1 5 NA 2 4: 4 2 6 6 3 5: 5 3 7 10 1 6: 6 4 8 9 3 7: 7 5 9 1 3 8: 8 6 10 2 2 9: 9 7 NA NA 2 10: 10 8 NA 3 1
# All defaults dt_fill(dt_with_nas, y, z, a, immutable = FALSE)
x y z a id <int> <int> <int> <int> <int> 1: 1 NA 3 NA 1 2: 2 NA 4 NA 3 3: 3 1 5 NA 2 4: 4 2 6 6 3 5: 5 3 7 10 1 6: 6 4 8 9 3 7: 7 5 9 1 3 8: 8 6 10 2 2 9: 9 7 10 2 2 10: 10 8 10 3 1
# by id variable called `grp` dt_fill(dt_with_nas, y, z, a, id = list(id))
x y z a id <int> <int> <int> <int> <int> 1: 1 NA 3 NA 1 2: 2 NA 4 NA 3 3: 3 1 5 NA 2 4: 4 2 6 6 3 5: 5 3 7 10 1 6: 6 4 8 9 3 7: 7 5 9 1 3 8: 8 6 10 2 2 9: 9 7 10 2 2 10: 10 8 10 3 1
# both down and then up filling by group dt_fill(dt_with_nas, y, z, a, id = list(id), .direction = "downup")
x y z a id <int> <int> <int> <int> <int> 1: 1 3 3 10 1 2: 2 2 4 6 3 3: 3 1 5 2 2 4: 4 2 6 6 3 5: 5 3 7 10 1 6: 6 4 8 9 3 7: 7 5 9 1 3 8: 8 6 10 2 2 9: 9 7 10 2 2 10: 10 8 10 3 1
dt_nest
Nesting data can be useful for a number of reasons, including XX. The dt_nest()
function takes a data.table
and ID variables and nests the remaining columns into a list column of data.table
s as shown below.
dt <- data.table( x = rnorm(1e5), y = runif(1e5), grp = sample(1L:5L, 1e5, replace = TRUE), nested1 = lapply(1:10, sample, 10, replace = TRUE), nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE), id = 1:1e5 ) nested <- dt_nest(dt, grp) nested
Key: <grp> grp data <int> <list> 1: 1 <data.table[19947x5]> 2: 2 <data.table[19981x5]> 3: 3 <data.table[20083x5]> 4: 4 <data.table[19929x5]> 5: 5 <data.table[20060x5]>
dt_pivot_longer and dt_pivot_wider
The last example for this brief post is pivoting. In my opinion, the pivot syntax is easy to remember and use and as such, is nice to have that syntax with the performance of melt()
and dcast()
. The syntax, although it doesn’t have the full functionality of tidyr
’s pivot functions, can do most things you need to do with reshaping data.
billboard <- tidyr::billboard longer <- billboard %>% dt_pivot_longer( cols = c(-artist, -track, -date.entered), names_to = "week", values_to = "rank" )
Warning in melt.data.table(data = dt_, id.vars = id_vars, measure.vars = cols, : 'measure.vars' [wk1, wk2, wk3, wk4, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
longer
artist track date.entered week rank <char> <char> <Date> <char> <num> 1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 3: 3 Doors Down Kryptonite 2000-04-08 wk1 81 4: 3 Doors Down Loser 2000-10-21 wk1 76 5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57 --- 24088: Yankee Grey Another Nine Minutes 2000-04-29 wk76 NA 24089: Yearwood, Trisha Real Live Woman 2000-04-01 wk76 NA 24090: Ying Yang Twins Whistle While You Tw... 2000-03-18 wk76 NA 24091: Zombie Nation Kernkraft 400 2000-09-02 wk76 NA 24092: matchbox twenty Bent 2000-04-29 wk76 NA
Can also take that long data set and turn it wide again.
wider <- longer %>% dt_pivot_wider( names_from = week, values_from = rank ) wider[, .(artist, track, wk1, wk2)]
artist track wk1 wk2 <char> <char> <num> <num> 1: 2 Pac Baby Don't Cry (Keep... 87 82 2: 2Ge+her The Hardest Part Of ... 91 87 3: 3 Doors Down Kryptonite 81 70 4: 3 Doors Down Loser 76 76 5: 504 Boyz Wobble Wobble 57 34 --- 313: Yankee Grey Another Nine Minutes 86 83 314: Yearwood, Trisha Real Live Woman 85 83 315: Ying Yang Twins Whistle While You Tw... 95 94 316: Zombie Nation Kernkraft 400 99 99 317: matchbox twenty Bent 60 37
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.