Big Data: Wrangling 4.6M Rows with dtplyr (the NEW data.table backend for dplyr)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Wrangling Big Data is one of the best features of the R
programming language, which boasts a Big Data Ecosystem that contains fast in-memory tools (e.g. data.table
) and distributed computational tools (sparklyr
). With the NEW dtplyr
package, data scientists with dplyr
experience gain the benefits of data.table
backend. We saw a 3X speed boost for dplyr
!
We’ll go over the pros and cons and what you need to know to get up and running using a real world example of Fannie Mae Loan Performance that when combined is 4.6M Rows by 55 Columns – Not super huge, but enough to show off the new and improved dtplyr
interface to the data.table
package. We’ll end with a Time Study showing a 3X Speed Boost and Learning Recommendations to get you expertise fast.
Like this article? Here are more just like it!
If you like this article, we have more just like it in our Machine Learning Section of the Business Science Learning Hub.
Table of Contents
1.0 The 30-Second Summary
We reviewed the latest advance in big data – The NEW dtplyr
package, which is an interface to the high performance data.table
library.
Pros
-
A 3X speed boost on the data joining and wrangling operations on a 4.6M ROw data set. The data wrangling operatiosn were performed in 6 seconds with
dtplyr
vs 18 seconds withdplyr
. -
Performs inplace operations (
:=
), which vastly accelerates big data computations (see grouped time serieslead()
operation in Section 3.7 tutorial) -
Shows the
data.table
translation (this is really cool!)
Cons
-
For pure speed, you will need to learn all of
data.table
’s features including managing keys for fast lookups. -
In most cases,
data.table
will be faster thandtplyr
because of overhead in thedtplyr
translation process. However, we saw the difference to be very minimal. -
dtplyr
is in experimental status currently – Tester’s wanted, file issues and requests here
What Should You Learn?
Just starting out? Our recommendation is to learn dplyr
first, then learn data.table
, using dtplyr
to bridge the gap
-
Begin with
dplyr
, which has easy-to-learn syntax and works well for datasets of 1M Rows+. -
Learn
data.table
as you become comfortable in R.data.table
is great for pure speed on data sets 50M Rows+. It has a different “bracketed” syntax that is streamlined but more complex for beginners. However, it has features like fast keyed subsetting and optimization for rolling joins that are out of the scope of this article. -
Use
dtplyr
as a translation tool to help bridge the gap betweendplyr
anddata.table
.
At a bare minimum – Learning dplyr
is essential. Learn more about a system for learning dplyr
in the Conclusions and Recommendations.
2.0 Big Data Ecosystem
R
has an amazing ecosystem of tools designed for wrangling Big Data. The 3 most popular tools are dplyr
, data.table
, and sparklyr
. We’ve trained hundreds of students on big data, and our students most common Big Data question is, “Which tool to use and when?”
Big Data: Data Wrangling Tools By Dataset Size
Source: Business Science Learning Lab 13: Wrangling 4.6M Rows (375 MB) of Financial Data with data.table
The “Big Data: Data Wrangling Tools by Dataset Size” graphic comes from Business Science’s Learning Lab 13: Wrangling 4.6M Rows (375 MB) of Financial Data with data.table where we taught students how to use data.table
using Fannie Mae’s Financial Data Set. The graphic provides rough guidelines on when to use which tools by dataset row size.
-
dplyr (website) – Used for in-memory calculations. Syntax design and execution emphasizes readability over performance. Very good in most situations.
-
data.table (website) – Used for higher in-memory performance. Modifies data inplace for huge speed gains. Easily wrangles data in the range of 10M-50M+ rows.
-
sparklyr (website) – Distribute work across nodes (clusters) and performs work in parallel. Best used on big data (100M+ Rows).
3.0 Enter dtplyr: Boost dplyr with data.table backend
We now have a 4th tool that boosts dplyr
using data.table
as its backend. The good news is that if you are already familiar with dplyr
, you don’t need to learn much to get the gains of data.table
!
dtplyr
: Bridging the Big Data Gap
The dtplyr package is a new front-end that wraps the High Performance data.table
R package. I say new, but dtplyr
has actually been around for over 2 years. However, the implementation recently underwent a complete overhaul vastly improving the functionality. Let’s check out the goals the package from the dtplyr
website: https://dtplyr.tidyverse.org/.
dtplyr
for Big Data
Here’s what you need to know:
-
Goal: Increase speed of working with big data when using
dplyr
syntax -
Implementation: The
dtplyr
package enables the user to writedplyr
code. Internally the package translates the code todata.table
syntax. When run, the user gains the faster performance ofdata.table
while being able to write the more readabledplyr
code. -
Dev Status: The package is still experimental. This means that developers are still in the process of testing the package out, reporting bugs, and improving via feature requests.
4.0 Case Study – Wrangling 4.6M Rows (375MB) of Financial Data
Let’s try out the new and improved dtplyr
+ data.table
combination on a large-ish data set.
4.1 Bad Loans Cost Millions (and Data Sets are MASSIVE)
Loan defaults cost organization millions. Further, the datasets are massive. This is a task where data.table
and dtplyr
will be needed as part of the preprocessing steps prior to building a Machine Learning Model.
4.2 Fannie Mae Data Set
The data used in the tutorial can be downloaded from Fannie Mae’s website. We will just be using the 2018 Q1 Acquisition and Performance data set.
A few quick points:
-
The 2018 Q1 Performance Data Set we will use is 4.6M rows, enough to send
Excel
to a grinding hault, crashing your computer in the process. -
For
dplyr
, it’s actually do-able at 4.6M rows. However, if we were to do the full 25GB, we’d definitely want to usedata.table
to speed things up. -
We’ll do a series of common data manipulation operations including joins and grouped time series calculation to determine which loans become delinquent in the next 3 months.
4.3 Install and Load Libraries
In this tutorial, we’ll use the latest Development Version of dtplyr
installed using devtools
. All other packages used can be used by installing with install.packages()
.
Next, we’ll load the the following libraries with library()
:
data.table
: High-performance data wranglingdtplyr
: Interface betweendplyr
anddata.table
tidyverse
: Loadsdplyr
and several other useful R packagesvroom
: Fast reading of delimited files (e.g. csv) withvroom()
tictoc
: Simple timing operationsknitr
: Use thekable()
function for nice HTML tables
4.4 Read the Data
We’ll read the data. The column-types are going to be pre-specified to assist in the loading process. The vroom()
function does the heavy lifting.
First, I’ll setup the paths to the two files I’ll be reading:
- Acquisitions_2018Q1.txt – Meta-data about each loan
- Performance_2018Q1.txt – Time series data set with loan performance characteristics over time
For me, the files are stored in a folder called 2019-08-15-dtplyr
. Your paths may be different depending on where the files are stored.
Read the Loan Acquisition Data
Note we specify the columns and types to improve the speed of reading the columns.
The loan acquisition data contains information about the owner of the loan.
loan_id | original_channel | seller_name | original_interest_rate | original_upb | original_loan_term | original_date | first_pay_date | original_ltv | original_cltv | number_of_borrowers | original_dti | original_borrower_credit_score | first_time_home_buyer | loan_purpose | property_type | number_of_units | occupancy_status | property_state | zip | primary_mortgage_insurance_percent | product_type | original_coborrower_credit_score | mortgage_insurance_type | relocation_mortgage_indicator |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100001040173 | R | QUICKEN LOANS INC. | 4.250 | 453000 | 360 | 2018-01-01 | 2018-03-01 | 65 | 65 | 1 | 28 | 791 | N | C | PU | 1 | P | OH | 430 | NA | FRM | NA | NA | N |
100002370993 | C | WELLS FARGO BANK, N.A. | 4.250 | 266000 | 360 | 2018-01-01 | 2018-03-01 | 80 | 80 | 2 | 41 | 736 | N | R | PU | 1 | P | IN | 467 | NA | FRM | 793 | NA | N |
100005405807 | R | PMTT4 | 3.990 | 233000 | 360 | 2017-12-01 | 2018-01-01 | 79 | 79 | 2 | 48 | 696 | N | R | SF | 1 | P | CA | 936 | NA | FRM | 665 | NA | N |
100008071646 | R | OTHER | 4.250 | 184000 | 360 | 2018-01-01 | 2018-03-01 | 80 | 80 | 1 | 48 | 767 | Y | P | PU | 1 | P | FL | 336 | NA | FRM | NA | NA | N |
100010739040 | R | OTHER | 4.250 | 242000 | 360 | 2018-02-01 | 2018-04-01 | 49 | 49 | 1 | 22 | 727 | N | R | SF | 1 | P | CA | 906 | NA | FRM | NA | NA | N |
100012691523 | R | OTHER | 5.375 | 180000 | 360 | 2018-01-01 | 2018-03-01 | 80 | 80 | 1 | 14 | 690 | N | C | PU | 1 | P | OK | 730 | NA | FRM | NA | NA | N |
Get the size of the acquisitions data set: 426K rows by 25 columns. Not that bad, but this is meta-data for the loan. The dataset we are worried about is the next one.
Read the Loan Performance Data
Let’s inspect the data. We can see that this is a time series where each “Loan ID” and “Monthly Reporting Period” go together.
loan_id | monthly_reporting_period | servicer_name | current_interest_rate | current_upb | loan_age | remaining_months_to_legal_maturity | adj_remaining_months_to_maturity | maturity_date | msa | current_loan_delinquency_status | modification_flag | zero_balance_code | zero_balance_effective_date | last_paid_installment_date | foreclosed_after | disposition_date | foreclosure_costs | prop_preservation_and_repair_costs | asset_recovery_costs | misc_holding_expenses | holding_taxes | net_sale_proceeds | credit_enhancement_proceeds | repurchase_make_whole_proceeds | other_foreclosure_proceeds | non_interest_bearing_upb | principal_forgiveness_upb | repurchase_make_whole_proceeds_flag | foreclosure_principal_write_off_amount | servicing_activity_indicator |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100001040173 | 2018-02-01 | QUICKEN LOANS INC. | 4.25 | NA | 0 | 360 | 360 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N | ||
100001040173 | 2018-03-01 | 4.25 | NA | 1 | 359 | 359 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N | |||
100001040173 | 2018-04-01 | 4.25 | NA | 2 | 358 | 358 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N | |||
100001040173 | 2018-05-01 | 4.25 | NA | 3 | 357 | 357 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N | |||
100001040173 | 2018-06-01 | 4.25 | NA | 4 | 356 | 356 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N | |||
100001040173 | 2018-07-01 | 4.25 | NA | 5 | 355 | 355 | 2048-02-01 | 18140 | 0 | N | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | N |
Let’s check out the data size. We can see it’s 4.6M rows by 31 columns! Just a typical financial time series (seriously).
4.5 Convert to Tibbles to dtplyr Steps
Next, we’ll use the lazy_dt()
function to convert the tibbles
to dtplyr
steps.
We can check the class()
to see what we are working with.
The returned object is the first step in a dtplyr
sequence.
Key Point:
- We are going to set up operations using a sequence of steps.
- The operations will not be fully evaluated until we convert to a
data.table
ortibble
depending on our desired output.
4.6 Join the Data Sets
Our first data manipulation operation is a join. We are going to use the left_join()
function from dplyr
. Let’s see what happens.
The output of the joining operation is a new step sequence, this time a dtplyr_step_subset
.
Next, let’s examine what happens when we print combined_dt
to the console.
Key Points:
-
The important piece is the
data.table
translation code, which we can see in the ouput:Call: _DT2[_DT1, on = .(loan_id)]
-
Note that we haven’t excecuted the data manipulation operation.
dtplyr
smartly gives us a glimpse of what the operation will look like though, which is really cool.
4.7 Wrangle the Data
We’ll do a sequence of data wrangling operations:
- Select specific columns we want to keep
- Arrange by
loan_id
andmonthly_reporting_period
. This is needed to keep groups together and in the right time-stamp order. - Group by
loan_id
and mutate to calculate whether or not loans become delinquent in the next 3 months. - Filter rows with
NA
values from the newly created column (these aren’t needed) - Reorder the columns to put the new calculated column first.
The final output is a dtplyr_step_group
, which is just a sequence of steps.
If we print the final_output_dt
object, we can see the data.table translation is pretty intense.
Key Point:
- The most important piece is that
dtplyr
correctly converted the grouped mutation to an inplace calculation, which isdata.table
speak for a super-fast calculation that makes no copies of the data. Here’s inplace calculation code from thedtplyr
translation:[, :=(gt_1mo_behind_in_3mo = lead(current_loan_delinquency_status, n = 3) >= 1), keyby = .(loan_id)]
4.8 Collecting The Data
Note that up until now, nothing has been done to process the data - we’ve just created a recipe for data wrangling. We still need tell dtplyr
to execute the data wrangling operations.
To implement all of the steps and convert the dtplyr
sequence to a tibble
, we just call as_tibble()
.
Key Point:
- Calling the
as_tibble()
function tellsdtplyr
to execute thedata.table
wrangling operations.
5.0 The 3X Speedup - Time Comparisons
Finally, let’s check the performance of the dplyr
vs dtplyr
vs data.table
. We can seed a nice 3X speed boost!
5.1 Time using dplyr
5.2 Time using dtplyr
5.3 Time using data.table
6.0 Conclusions and Learning Recommendations
For Big Data wrangling, the dtplyr
package represents a huge opportunity for data scientists to leverage the speed of data.table
with the readability of dplyr
. We saw an impressive 3X Speedup going from dplyr
to using dtplyr
for wrangling a 4.6M row data set. This just scratches the surface of the potential, and I’m looking forward to seeing dtplyr
mature, which will help bridge the gap between the two groups of data scientists using dplyr
and data.table
.
For new data scientists coming from other tools like Excel
, my hope is that you see the awesome potential of learning R
for data analysis and data science. The Big Data capabilities represent a massive opportunity for you to bring your organization data science at scale.
You just need to learn how to go from normal data to Big Data.
My recommendation is to start by learning dplyr
- The popular data manipulation library that makes reading and writing R code very easy to understand.
Once you get to an intermediate level, learn data.table
. This is where you gain the benefits of scaling data science to Big Data. The data.table
package has a steeper learning curve, but learning it will help you leverage its full performance and scalability.
If you need to learn dplyr
as fast as possible - I recommend beginning with our Data Science Foundations DS4B 101-R Course. The 101 Course is available as part of the 3-Course R-Track Bundle, a complete learning system designed to transform you from beginner to advanced in under 6-months. You will learn everything you need to become an expert data scientist.
7.0 Additional Big Data Guidelines
I find that students have an easier time picking a tool based on dataset row size (e.g. I have 10M rows, what should I use?). With that said, there are 2 factors that will influence whhich tools you need to use:
-
Are you performing Grouped and Iterative Operations? Performance even on normal data sets can become an issue if you have a lot of groups or if the calculation is iterative. A particular source of pain in the financial realm are rolling (window) calculations, which are both grouped and iterative within groups. In these situation, use high-performance C++ functions (e.g. Rolling functions from the
roll
package orRcppRoll
package). -
Do you have sufficient RAM? Once you begin working with gig’s of data, then you start to run out of memory (RAM). In these situations, you will need to work in chunks and parellelizing operations. You can do this with distributed
sparklyr
, which will perform some operations in parallel and distribute across nodes.
8.0 Recognizing the Developers
I’d like to take a quick moment to thank the developers of data.table
and dplyr
. Without these two packages, Business Science probably would not exist. Thank you.
-
Matt Dowle - Creator of
data.table
in R. Check out the R data.table website. -
Hadley Wickham - Creator of
dplyr
anddtplyr
. Check out the dplyr website and the new dtplyr website.
9.0 Coming Soon - Expert Shiny Apps Course!
I’m very excited to announce that Business Science has an Expert Shiny Course - Coming soon! Head over to Business Science University and create a free account. I will update you with the details shortly.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.