Parallel Processing Baseball Data with R and mlbgameday
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Just In Time For Baseball
The mlbgameday
package has just reached the milestone of version 0.1.0.
Designed to facilitate extract, transform and load for MLBAM “Gameday” data. The package is optimized for parallel processing of data that may be larger than memory. There are other packages in the R universe that were built to perform statistics and visualizations on these data, but mlbgameday is concerned primarily with data collection. More uses of these data can be found in the pitchRx, openWAR, and baseballr packages.
Install from CRAN
install.packages("mlbgameday")
Parallel Processing
The package’s internal functions are optimized to work with the doParallel
package. By default, the R language will use one core of our CPU. The doParallel
package enables us to use several cores, which will execute tasks simultaneously. In a standard regular season for all teams, the function has to process more than 2,400 individual files, which depending on your system, can take quite some time. Parallel processing speeds this process up by several times, depending on how many processor cores we choose to use.
Non Parallel
Although the package is optimized for parallel processing, it will also work without registering a parallel backend. When only querying a single day’s data, a parallel backend may not provide much additional performance. However, parallel backends are suggested for larger data sets, as the process will be faster by several orders of magnitude.
We can download and subset a small amount of data. In the example below, we’ll look for Jake Arrienta’s no-hitter in 2016.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.