Data table exercises: keys and subsetting

Han de Vries

6 years ago

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The data.table package is a popular R package that facilitates fast selections, aggregations and joins on large data sets. It is well-documented through several vignettes, and even has its own interactive course, offered by Datacamp. For those who want to build some mileage practising the use of data.table, there’s good news! In the coming weeks, we’ll dive into the package with several exercise sets. We’ll start with the first set today, focusing on creating data.tables, defining keys and subsetting. Before proceeding, make sure you have installed the data.table package from CRAN and studied the vignettes.

Answers to the exercises are available here. For the other (upcoming) exercise sets on data.table, check back next week here. If there are any particular topics/problems related to data.table, you’d like to see included in subsequent exercise sets, please post as a comment below.

Exercise 1
Setup: Read the wine quality dataset from the uci repository as a data.table (available for download from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) into an object named df. To demonstrate the speed of data.table, we’re going to make this dataset much bigger, with:
df Check that the resulting data.table has 4.8 mln. rows and 12 variables.

Exercise 2
Check if df contains any keys. If no keys are present, create a key for the quality variable. Confirm that the key has been set.

Exercise 3
Create a new data.table df2, containing the subset of df with quality equal to 9.

Exercise 4
Remove the key from df, and repeat exercise 3. How much slower is this?

Exercise 5
Create a new data.table df2, containing the subset of df with quality equal to 7, 8 or 9. First without setting keys, then with setting keys and compare run-time.

Exercise 6
Create a new data.table df3 containing the subset of observations from df with:
fixed acidity < 8 and residual sugar < 5 and pH < 3. First without setting keys, then with setting keys and compare run-time. Explain why differences are small.

Exercise 7
Take a bootstrap sample (i.e., with replacement) of the full df data.table without keys, and record run-time. Then, convert to a regular data frame, and repeat. What is the difference in speed? Is there any (speed) benefit in creating a new variable id equal to the row number, creating a key for this variable, and use this key to select the bootstrap?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.