Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The newest version of data.table has hit CRAN, and there are lots of great new features.
Among them, a %notin%
function, a new let
function that can be used instead of :=
( I wasn’t too fussed about this originally but have tried it a few times today and I may well adopt it – although I do like that :=
really stands out in my code when assigning / updating variables.
The big feature is the new programming interface. I have blogged about programming on data.table before, but things have moved on.
In my packages currently, I use get
to retrieve variable names (that, and a rather tortuous method of grabbing the original names, setting them to something else, then switching them back at the end). I no longer need to do this, which is particularly handy, as my {spccharter} package has been sitting dormant while awaiting this new programming approach.
A few examples of making it work – first of all, a handy descending sort function, as I find myself doing this a lot.
library(data.table) library(dplyr) # we'll mimic some of the dplyr examples later on library(palmerpenguins) # about time I used this, I suppose
I’ll be honest, I was a bit lost with how to approach this, but Jan Gorecki saw my post and gave me a nudge in the right direction (not for the first time – thanks Jan!)
For posterity, here is my first attempt, which worked when supplying a quoted variable, but not an unquoted one.
sorted2 <- function(.DT = DT, x) { res <- .DT[,.N, .(V1 = x), env = list(x = x)][order(-N)] setnames(res, "V1", x) res }
All I needed was to add in the env = list(x = substitute(x))
to the end of the first line of my function.
This is how it should have been:
descending_sort <- function(.DT, x) { .DT[, .N, x, env = list(x = substitute(x))][order(-N)] }
Let’s test it, and, for the first time on this blog, I’ll use PalmerPenguins for an example
pingu <- setDT(copy(palmerpenguins::penguins)) # I ain't typing "penguins" over and over names(pingu) # have avoided this dataset for years so need a reminder of what's actually in it descending_sort(pingu, species) # species N # <fctr> <int> #1: Adelie 152 #2: Gentoo 124 #3: Chinstrap 68
That works for one variable, here is a function that sorts any number of variables
descending_group_sort <- function(.DT, ...) { vars <- eval(substitute(alist(...)), envir = parent.frame()) .DT[, .N, by = vars, env = list(vars = substitute(vars)) ][order(-N)] } descending_group_sort(pingu, flipper_length_mm, body_mass_g)
(You’ll have to trust me, I’m not pasting 306 rows into this post)
Now let’s nick some examples from dplyr, and mimic them with our new data.table functionality:
## dplyr examples var_summary <- function(data, var) { data %>% summarise(n = n(), min = min(), max = max()) } mtcars %>% group_by(cyl) %>% var_summary(mpg) # A tibble: 3 × 4 # cyl n min max # <dbl> <int> <dbl> <dbl> #1 4 11 21.4 33.9 #2 6 7 17.8 21.4 #3 8 14 10.4 19.2
And here’s the same with the new release of data.table:
mtc <- setDT(copy(mtcars)) # copy and turn it into a data.table setorder(mtc, cyl) # ensure the order matches the dplyr results var_summary_dt <- function(data, var, grp) { data[, .(n = .N, min = min(var), max = max(var)), .(grp), env = list(var = substitute(var), grp = substitute(grp))] } var_summary_dt(mtc, mpg, cyl) # cyl n min max # <num> <int> <num> <num> #1: 4 11 21.4 33.9 #2: 6 7 17.8 21.4 #3: 8 14 10.4 19.2
Looks good to me!
Again, all we had to do was add in the calls to substitute
in the env
Here are some further dplyr examples – a summary function for one or more variables from the starwars dataset:
my_summarise <- function(.data, ...) { .data %>% group_by(...) %>% summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE)) } starwars %>% my_summarise(homeworld) # too many rows to print here starwars %>% my_summarise(sex, gender) # A tibble: 6 × 4 # Groups: sex [5] # sex gender mass height # <chr> <chr> <dbl> <dbl> #1 female feminine 54.7 172. #2 hermaphroditic masculine 1358 175 #3 male masculine 80.2 179. #4 none feminine NaN 96 #5 none masculine 69.8 140 #6 NA NA 81 175
And the data.table equivalent
starwars_dt <- setDT(copy(starwars)) my_summarise_dt <- function(.dt, ...) { vars <- eval(substitute(alist(...)), envir = parent.frame()) .dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols =c("mass", "height"), by = vars, env = list(vars = substitute(vars))][] }
Let’s try it:
my_summarise_dt(.dt = starwars_dt, homeworld) # too many rows to print my_summarise_dt(.dt = starwars_dt, sex, gender) # same as dplyr # sex gender mass height # <char> <char> <num> <num> #1: male masculine 80.21905 179.1228 #2: none masculine 69.75000 140.0000 #3: female feminine 54.68889 171.5714 #4: hermaphroditic masculine 1358.00000 175.0000 #5: <NA> <NA> 81.00000 175.0000 #6: none feminine NaN 96.0000
Disclaimer
I’m not sure about the use of eval(substitute(alist(...)), envir = parent.frame())
.
It works, but may not be what the data.table devs intended (feel free to put me right)
This does work, but there may be some sort of unintended cosequences that have not yet smacked me in the face. That said, data.table’s error handling is so freakishly accurate (like, in the room watching over your shoulder) that any issues should be easy enough to solve.
There’s a lot more to this new approach, but I generally only usually need to work with column names, rather than creating super flexible functions, so this little bit of knowledge is more than enough to keep me going for now.
I will update as I figure more things out.
In the meantime, you should check out the new version of data.table, and find your favourite new feature.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.