Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Answers to the exercises are available here. For the other (upcoming) exercise sets on data.table, check back next week here. If there are any particular topics/problems related to data.table, you’d like to see included in subsequent exercise sets, please post as a comment below.
Exercise 1
Setup: Read the wine quality dataset from the uci repository as a data.table (available for download from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) into an object named df
. To demonstrate the speed of data.table, we’re going to make this dataset much bigger, with:
df
Check that the resulting data.table has 4.8 mln. rows and 12 variables.
Exercise 2
Check if df
contains any keys. If no keys are present, create a key for the quality
variable. Confirm that the key has been set.
Exercise 3
Create a new data.table df2
, containing the subset of df
with quality equal to 9.
Exercise 4
Remove the key from df
, and repeat exercise 3. How much slower is this?
Exercise 5
Create a new data.table df2
, containing the subset of df
with quality equal to 7, 8 or 9. First without setting keys, then with setting keys and compare run-time.
Exercise 6
Create a new data.table df3
containing the subset of observations from df
with:
fixed acidity < 8 and residual sugar < 5 and pH < 3. First without setting keys, then with setting keys and compare run-time. Explain why differences are small.
Exercise 7
Take a bootstrap sample (i.e., with replacement) of the full df
data.table without keys, and record run-time. Then, convert to a regular data frame, and repeat. What is the difference in speed? Is there any (speed) benefit in creating a new variable id
equal to the row number, creating a key for this variable, and use this key to select the bootstrap?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.