Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RevoscaleR Package for R language is package for scalable, distributed and parallel computation, available along with Microsoft R Server (and in-Database R Services). It solves many of limitations that R language is facing when run from a client machine. RevoScaleR Package addresses several of these issues:
- memory based data access model -> dataset can be bigger than the size of a RAM
- lack of parallel computation -> offers distributed and parallel computation
- data movement -> no more need for data movement due to ability to set computational context
- duplication costs -> with computational context set and different R versions (Open, Client or Server) data reside on one place, making maintenance cheaper and no duplication on different locations are needed
- governance and providence -> RevoscaleR offers oversight of both with setting and additional services in R Server
- hybrid typologies and agile development -> on-premises + cloud + client combination allow hybrid environment development for faster time to production
Before continuing, make sure you have RevoScaleR package installed in your R environment. To check, which computational functions are available within this package, let us run following:
RevoInfo <-packageVersion("RevoScaleR") RevoInfo
to see the version of RevoScaleR package. In this case it is:
[1] ‘9.0.1’
Now we will run command to get the list of all functions:
revoScaleR_objects <- ls("package:RevoScaleR") revoScaleR_objects
Here is the list:
All RevoScaleR functions have prefix rx or Rx, so it is much easier to distinguish functions from functions available in other similar packages – for example rxKMeans and kmeans.
find("rxKmeans") find("kmeans")
Showing results – name of the package where each function is based:
> find("rxKmeans") [1] "package:RevoScaleR" > find("kmeans") [1] "package:stats"
The output or RevoScaleR object, shows 200 computational functions, but I will focus only on couple of them.
RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.
Preparing and storing or importing your data into XDF is important part of achieving faster computational time. Download some sample data from revolution analytics blog. I will be taking some AirOnTime data, a CSV file from here.
With help of following functions will help you to, I will import file from csv into xdf format.
rxTextToXdf() – for importing data to .xdf format from a delimited text file or csv.
rxDataStepXdf() – for transforming and subseting data of variables and/or rows for data exploration and analysis.
setwd("C:/Users/Documents/33") rxTextToXdf(inFile = "airOT201201.csv", outFile = "airOT201201.xdf", stringsAsFactors = T, rowsPerRead = 200000)
rxGetInfo("airOT201201.xdf", getVarInfo = TRUE, numRows = 20)
rxSummary(~DAY_OF_WEEK, data="airOT201201.xdf") #or for the whole dataset rxSummary(~., data="airOT201201.xdf")
Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.002 seconds Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: 0.002 seconds Computation time: 0.018 seconds. Call: rxSummary(formula = ~DAY_OF_WEEK, data = "airOT201201.xdf") Summary Statistics Results for: ~DAY_OF_WEEK Data: "airOT201201.xdf" (RxXdfData Data Source) File name: airOT201201.xdf Number of valid observations: 486133 Name Mean StdDev Min Max ValidObs MissingObs DAY_OF_WEEK 3.852806 2.064557 1 7 486133 0
#histogram rxHistogram(~DAY_OF_WEEK, data="airOT201201.xdf") Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.004 seconds Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: Less than .001 seconds Computation time: 0.019 seconds.
Some of the following algorithms for predictions are available (and many more in addition):
Air_DTree <- rxDTree(DEP_DELAY_NEW ~ DAY_OF_WEEK + ACTUAL_ELAPSED_TIME + DISTANCE_GROUP, maxDepth = 3, minBucket = 30000, data = "airOT201201.xdf")
Visualizing the tree data:
plotcp(rxAddInheritance(Air_DTree)) plot(rxAddInheritance(Air_DTree)) text(rxAddInheritance(Air_DTree))
or you can use the RevoTreeView package, which is even smarter:
library(RevoTreeView) plot(createTreeView(Air_DTree))
we can visualize the tree:
Of course, pruning and checking for over-fitting must also be done.
When comparing – for example exDTrees to original function, the performance si much better in favor of R. And if you have the ability to use RevoScaleR package for computations on larger datasets or your client might be an issue, use this package. It sure will make your life easier.
Happy R-SQLing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.