Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Since my initial post on parallel processing with multidplyr
, there have been some recent changes in the tidy
eco-system: namely the package tidyquant
, which brings financial analysis to the tidyverse
. The tidyquant
package drastically increase the amount of tidy financial data we have access to and reduces the amount of code needed to get financial data into the tidy format. The multidplyr
package adds parallel processing capability to improve the speed at which analysis can be scaled. I seriously think these two packages were made for each other. I’ll go through the same example used previously, updated with the new tidyquant
functionality.
Table of Contents
- Parallel Processing Applications in Finance
- Prerequisites
- Workflow
- Real World Example
- Conclusion
- Recap
- Further Reading
Parallel Processing Applications in Financial Analysis
Collecting financial data in tidy format was overly difficult. Getting from xts
to tibble
was a pain, and there’s some amazing tidyverse
functionality that cannot be used without the tibble
(tidy data frame) format. That all changed with tidyquant
. There’s a wide range of free data sources, and the tidyquant
package makes it super simple to get financial and economic data in tidy format (more on this in a minute).
There’s one caveat to collecting data at scale: it takes time. Getting data from the internet, historical stock prices or financial statements or real-time statistics, for 500+ stocks can take anywhere from several minutes to 20+ minutes depending on the data size and number of stocks. tidyquant
makes it easier to get and analyze data in the correct format, but we need a new tool to speed up the process. Enter multidplyr
.
The multidplyr
package makes it super simple to parallelize code. It works perfectly in the tidyverse
, and, by the associative propery, works perfectly with tidyquant
.
The example in this post shows off the new tq_get()
function for getting stock prices at scale. However, we can get and scale much more than just stock prices. tq_get
has the following data retrieval options:
-
get = "stock.index"
: This retrieves the entire list of stocks in an index. 18 indexes are available. Usetq_get_stock_index_options()
to see the full list. -
get = "stock.prices"
: This retrieves historical stock prices over a time period specified byto
andfrom
. This is the default option, and the one we use in this post. -
get = "key.ratios"
: This retrieves the key ratios from Morningstar, which are historical values over the past 10-years. There are 89 key ratios ranging from valuation, to growth and profitability, to efficiency. This is a great place to chart business performance over time and to compare the key ratios by company. Great for financial analysis! -
get = "key.stats"
: This retrieves real-time key stats from Yahoo Finance, which consist of bid, ask, day’s high, day’s low, change, current P/E valuation, current Market Cap, and many more up-to-the-minute stats on a stock. This is a great place for the day trader to work since all of the data is accurate as of the second you download it. -
get = "financials"
: This retrieves the annual and quarterly financial statement data from Google Finance. Great for financial analysis! -
get = "economic.data"
: This retrieves economic data from the FRED database by FRED code. As of the blog post date, there are 429,000 US and international time-series data sets from 80 sources. All you need is the FRED symbol such as “CPIAUCSL” for CPI. Vist the FRED website to learn more. -
Other get options:
"metal.prices"
,"exchange.rates"
,"dividends"
, and"splits"
. There’s lots of data you can get usingtq_get
!
The point I want to make is that ANY OF THESE GET OPTIONS CAN BE SCALED USING THE PROCESS I USE NEXT!!!
Prerequisites
The multidplyr
package is not available on CRAN, but you can install it using devtools
. Also, install the development version of tidyquant
which has added functionality that will be available on CRAN soon with v0.3.0.
For those following along in R, you’ll need to load the following packages:
I also recommend the open-source RStudio IDE, which makes R Programming easy and efficient.
Workflow
The multidplyr
workflow can be broken down into five basic steps shown in Figure 1. The five steps are implemented in Processing in Parallel.
Figure 1: multidplyr Workflow
Essentially, you start with some data set that you need to do things to multiple times. Your situation generally falls into one of two types:
- It could be a really large data set that you want to split up into several small data sets and perform the same thing on each.
- It could be one data set that you want to perform multiple things on (e.g. apply many models).
The good news is both situations follow the same basic workflow. The toughest part is getting your data in the format needed to process using the workflow. Don’t worry, we’ll go through a real world example shortly so you can see how this is accomplished.
Real World Example
We’ll go through the multidplyr
workflow using a real world example that I routinely use: collecting stock prices from the inter-web. Other uses include using modeling functions over grouped data sets, using many models on the same data set, and processing text (e.g. getting n-grams on large corpora). Basically anything with a loop.
Prep-Work
In preparation for collecting stock prices, we need two things:
- A list of stocks
- A function to get stock prices from a stock symbol
Let’s see how tidyquant makes this easy. First, getting a stock index used to be a pain:
Before tidyquant (Don’t use this):
Now with tidyquant (Use this!):
Second, getting stock prices in tidy format used to be a pain:
Before tidyquant (Don’t use this):
Now with tidyquant (Use this!):
Note that you can replace "stock.prices"
with "key.ratios"
, "key.stats"
, "financials"
, etc to get other financial data for a stock symbol. These options can be scaled as well!
Processing In Series
The next computation is the routine that we wish to parallelize, but first we’ll time the script running on one processor, looping in series. We are collecting ten years of historical daily stock prices for each of the 500+ stocks. This is now a simple chaining operation with tidyquant
, which accepts a tibble of stocks with the symbols in the first column.
The result, sp_500_processed_in_series
is a tibble
(tidy data frame) with the stock prices for the 500+ stocks.
And, let’s see how long it took when processing in series. The processing time is the time elapsed in seconds. Converted to minutes this is approximately 3.71 minutes.
Processing in Parallel
We just collected ten years of daily stock prices for over 500 stocks in about 3.71 minutes. Let’s parallelize the computation to get an improvement. We will follow the six steps shown in Figure 1.
Step 0: Get Number of Cores (Optional)
Prior to starting, you may want to determine how many cores your machine has. An easy way to do this is using parallel::detectCores()
. This will be used to determine the number of groups to split the data into in the next set.
Step 1: Add Groups
Let’s add groups to sp_500
. The groups are needed to divide the data across your cl
number cores. For me, this is 8 cores. We create a group
vector, which is a sequential vector of 1:cl
(1 to 8) repeated the length of the number of rows in sp_500
. We then add the group
vector to the sp_500
tibble using the dplyr::bind_cols()
function.
Step 2: Create Clusters
Use the create_cluster()
function from the multidplyr
package. Think of a cluster as a work environment on a core. Therefore, the code below establishes a work environment on each of the 8 cores.
Step 3: Partition by Group
Next is partitioning. Think of partitioning as sending a subset of the initial tibble
to each of the clusters. The result is a partitioned data frame (party_df
), which we explore next. Use the partition()
function from the multidplyr
package to split the sp_500
list by group and send each group to a different cluster.
The result, by_group
, looks similar to our original tibble
, but it is a party_df
, which is very different. The key is to notice that the there are 8 Shards
. Each Shard
has between 63 and 64 rows, which evenly splits our data among each shard. Now that our tibble
has been partitioned into a party_df
, we are ready to move onto setting up the clusters.
Step 4: Setup Clusters
The clusters have a local, bare-bones R work environment, which doesn’t work for the vast majority of cases. Code typically depends on libraries, functions, expressions, variables, and/or data that are not available in base R. Fortunately, there is a way to add these items to the clusters. Let’s see how.
For our computation, we are going to need to add the tidyquant
library and our variables to
, and from
. We do this by using the cluster_library()
and cluster_assign_value()
functions, respectively.
We can verify that the libraries are loaded using the cluster_eval()
function.
Step 5: Run Parallelized Code
Now that we have our clusters and partitions set up and everything looks good, we can run the parallelized code. The code chunk is a little bit different than before because we need to use purrr
to map tq_get()
to each stock symbol:
- Instead of starting with the
sp_500 tibble
, we start with theby_group party_df
- This is how we scale with
purrr
: We use a combination ofdplyr::mutate()
andpurrr::map()
to map ourtq_get()
function to the stocks. - We combine the results at the end using the
multidplyr::collect()
function. The result is a nested tibble. - We unnest with
tidyr::unnest()
to return the same single-level tibble as before. - Finally, we use
dplyr::arrange()
to arrange the stocks in the same order as previous. Thecollect()
function returns the shards (cluster data groups) binded in whatever group is done first. Typically you’ll want to re-arrange.
Let’s check out the results
And, let’s see how long it took when processing in parallel.
The processing time is approximately 0.66 minutes, which is 5.6X faster! Note that it’s not a full 8X faster because of transmission time as data is sent to and from the nodes. With that said, the speed will approach 8X improvement as calculations become longer since the transmission time is fixed whereas the computation time is variable.
Conclusion
Parallelizing code can drastically improve speed on multi-core machines. It makes the most sense in situations involving many iterative computations. On an 8 core machine, processing time significantly improves. It will not be quite 8X faster, but the longer the computation the closer the speed gets to the full 8X improvement. For a computation that takes two minutes under normal conditions, we improved the processing speed by over 5X through parallel processing!
Recap
The focus of this post was to show how you can implement tidyquant
with the multidplyr
parallel-processing package for parallel processing financial applications. We worked through the five main steps in the multidplyr
workflow using the new tidyquant
package, which makes it much easier to get financial data in tidy format. Keep in mind that you can use any tq_get
“get” option at scale.
Further Reading
-
Tidyquant Vignettes: This tutorial just scratches the surface of
tidyquant
. The vignettes explain much, much more! -
Multidplyr on GitHub: The vignette explains the
multidplyr
workflow using theflights
data set from thenycflights13
package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.