Site icon R-bloggers

Data.Table by Example – Part 2

[This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv")
names(dat) <- gsub(" ", "_", names(dat))
dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)]

dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE),
           value2 = sample(1:50, nrow(dat), replace=TRUE),
           value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE)  
dat$value2 <- sample(1:50, nrow(dat), replace=TRUE)  
dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)  

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE),
        median = median(value1, na.rm = TRUE),
        min = min(value1, na.rm = TRUE),
        max = max(value1, na.rm = TRUE))]

dat[Primary_Type=="PROSTITUTION", 
                  .(mean = mean(value1, na.rm = TRUE),
                    median = median(value1, na.rm = TRUE),
                    min = min(value1, na.rm = TRUE),
                    max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")] 

dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4] 

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){
   c(mean = mean(x), 
     median = median(x),
     min = min(x),
     max = max(x))
}

dat[, sapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part, I’ll cover a few advanced features like get() and {}.


To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.