Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The RevoScaleR package isn’t open source, but it is free for academic users.
Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.
Scalability
What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time. Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.
Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.
High performance analytics = HPC + Data
- HPC is CPU centric. Lot’s of processing on small amounts of data.
- HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Example
- Initialization task: total = 0, count = 0;
- Process data tasks: for each block of x, total =sum(x), count = length(x);
- Update results: combine total and count;
- Process results.
ScaleR
ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:
- data is stored in blocks of rows
- header is at the end
- allows sequential reds
- essentially unlimited in size
- Efficient desk space usage.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.