new programming with data.table

[This article was first published on Data By John, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The newest version of data.table has hit CRAN, and there are lots of great new features.

Among them, a %notin% function, a new let function that can be used instead of := ( I wasn’t too fussed about this originally but have tried it a few times today and I may well adopt it – although I do like that := really stands out in my code when assigning / updating variables.

The big feature is the new programming interface. I have blogged about programming on data.table before, but things have moved on.

In my packages currently, I use get to retrieve variable names (that, and a rather tortuous method of grabbing the original names, setting them to something else, then switching them back at the end). I no longer need to do this, which is particularly handy, as my {spccharter} package has been sitting dormant while awaiting this new programming approach.

A few examples of making it work – first of all, a handy descending sort function, as I find myself doing this a lot.

library(data.table) 
library(dplyr) # we'll mimic some of the dplyr examples later on
library(palmerpenguins) # about time I used this, I suppose

I’ll be honest, I was a bit lost with how to approach this, but Jan Gorecki saw my post and gave me a nudge in the right direction (not for the first time – thanks Jan!)

For posterity, here is my first attempt, which worked when supplying a quoted variable, but not an unquoted one.

sorted2 <- function(.DT = DT, x) {
  res <- .DT[,.N, .(V1 = x),
             env = list(x = x)][order(-N)]
  setnames(res, "V1", x)
  res
}

All I needed was to add in the env = list(x = substitute(x)) to the end of the first line of my function.

This is how it should have been:

descending_sort <- function(.DT, x) {
  .DT[, .N, x, env = list(x = substitute(x))][order(-N)]
}

Let’s test it, and, for the first time on this blog, I’ll use PalmerPenguins for an example

pingu <- setDT(copy(palmerpenguins::penguins))  # I ain't typing "penguins" over and over
names(pingu) # have avoided this dataset for years so need a reminder of what's actually in it
descending_sort(pingu, species)

#     species     N
#      <fctr> <int>
#1:    Adelie   152
#2:    Gentoo   124
#3: Chinstrap    68

That works for one variable, here is a function that sorts any number of variables

descending_group_sort <- function(.DT, ...) {

  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())
  .DT[,
      .N,
      by = vars,
      env = list(vars = substitute(vars))
      ][order(-N)]
}

descending_group_sort(pingu, flipper_length_mm, body_mass_g)

(You’ll have to trust me, I’m not pasting 306 rows into this post)

Now let’s nick some examples from dplyr, and mimic them with our new data.table functionality:

## dplyr examples

var_summary <- function(data, var) {
  data %>%
    summarise(n = n(),
              min = min(),
              max = max())
}

mtcars %>%
  group_by(cyl) %>%
  var_summary(mpg)

# A tibble: 3 × 4
#    cyl     n   min   max
#  <dbl> <int> <dbl> <dbl>
#1     4    11  21.4  33.9
#2     6     7  17.8  21.4
#3     8    14  10.4  19.2

And here’s the same with the new release of data.table:

mtc <- setDT(copy(mtcars)) # copy and turn it into a data.table
setorder(mtc, cyl) # ensure the order matches the dplyr results

var_summary_dt <- function(data, var, grp) {
  data[, .(n = .N,
           min = min(var),
           max = max(var)),
       .(grp),
       env = list(var = substitute(var),
                  grp = substitute(grp))]
}

var_summary_dt(mtc, mpg, cyl)

#    cyl     n   min   max
#   <num> <int> <num> <num>
#1:     4    11  21.4  33.9
#2:     6     7  17.8  21.4
#3:     8    14  10.4  19.2

Looks good to me!
Again, all we had to do was add in the calls to substitute in the env

Here are some further dplyr examples - a summary function for one or more variables from the starwars dataset:

my_summarise <- function(.data, ...) {
  .data %>%
    group_by(...) %>%
    summarise(mass = mean(mass, na.rm = TRUE),
              height = mean(height, na.rm = TRUE))
}

starwars %>% my_summarise(homeworld) # too many rows to print here
starwars %>% my_summarise(sex, gender)

# A tibble: 6 × 4
# Groups:   sex [5]
# sex            gender      mass height
# <chr>          <chr>      <dbl>  <dbl>
#1 female         feminine    54.7   172.
#2 hermaphroditic masculine 1358     175 
#3 male           masculine   80.2   179.
#4 none           feminine   NaN      96 
#5 none           masculine   69.8   140 
#6 NA             NA          81     175 

And the data.table equivalent

starwars_dt <- setDT(copy(starwars))

my_summarise_dt <- function(.dt, ...) {

  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())

  .dt[, lapply(.SD, mean, na.rm = TRUE),
      .SDcols =c("mass", "height"),
      by = vars,
      env = list(vars = substitute(vars))][]
}

Let’s try it:

my_summarise_dt(.dt = starwars_dt, homeworld) # too many rows to print
my_summarise_dt(.dt = starwars_dt, sex, gender) # same as dplyr
#              sex    gender       mass   height
#           <char>    <char>      <num>    <num>
#1:           male masculine   80.21905 179.1228
#2:           none masculine   69.75000 140.0000
#3:         female  feminine   54.68889 171.5714
#4: hermaphroditic masculine 1358.00000 175.0000
#5:           <NA>      <NA>   81.00000 175.0000
#6:           none  feminine        NaN  96.0000

Disclaimer

I’m not sure about the use of eval(substitute(alist(...)), envir = parent.frame()).

It works, but may not be what the data.table devs intended (feel free to put me right)

This does work, but there may be some sort of unintended cosequences that have not yet smacked me in the face. That said, data.table’s error handling is so freakishly accurate (like, in the room watching over your shoulder) that any issues should be easy enough to solve.

There’s a lot more to this new approach, but I generally only usually need to work with column names, rather than creating super flexible functions, so this little bit of knowledge is more than enough to keep me going for now.
I will update as I figure more things out. In the meantime, you should check out the new version of data.table, and find your favourite new feature.

To leave a comment for the author, please follow the link and comment on their blog: Data By John.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)