Site icon R-bloggers

A quirk when using data.table?

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently came across this quirk in using data.table that I don’t really have a clean solution for. I outline the issue below as well as my current way around it. Appreciate any better solutions!

The problem surfaces quite generally, but I’ll illustrate it by trying to achieve the following task: write a function that takes a data table and a column name, and returns the data table with the data in that column scrambled.

The function below was my first attempt:

library(data.table)

scramble_col <- function(input_dt, colname) {
  input_dt[[colname]] <- sample(input_dt[[colname]])
  input_dt
}

The code snippet below shows that it seems to work:

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2

However, when I tried to add a new column that is a replica of the x column, I get a strange warning!

input_dt[, y := x]  # gives warning

There are few webpages out there that try to explain what’s going on with this warning, but I haven’t had time to fully digest what is going on. My high-level takeaway is that the assignment in the line input_dt[[colname]] <- sample(input_dt[[colname]]) is problematic.

This was my second attempt:

scramble_col <- function(input_dt, colname) {
  input_dt[, c(colname) := sample(get(colname))]
}

This version works well: it doesn’t throw the warning when I added a second column.

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2
input_dt[, y := x]
input_dt
#    x y
# 1: 1 1
# 2: 4 4
# 3: 3 3
# 4: 5 5
# 5: 2 2

However, the function does not work for a particular input: when the column name is colname! When I run the following code, I get an error message.

input_dt <- data.table(colname = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "colname") # error

The function below was my workaround and I think it works for all inputs, but it seems a bit inelegant:

scramble_col <- function(input_dt, colname) {
  new_col <- sample(input_dt[[colname]])
  input_dt[, c(colname) := new_col]
}

Would love to hear if anyone has a better solution for this task, or if you have some insight into what is going on with the warning/error above.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.