A simple Big Data analysis using the RevoScaleR package in Revolution R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post from Stephen Weller is part of a series from members of the Revolution Analytics Engineering team. Learn more about the RevoScaleR package, available free to academics as part of Revolution R Enterprise — ed.
The RevoScaleR package, installed with Revolution R Enterprise, offers parallel external memory algorithms that help R break through memory and performance limitations.
RevoScaleR contains:
- The .xdf data file format, designed for fast processing of blocks of data, and
- A growing number of external memory implementations of the statistical algorithms most commonly used with large data sets
Here is a sample RevoScaleR analysis that uses a subset of the airline on-time data reported each month to the U.S. Department of Transportation (DOT) and Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers. This data contains three columns: two numeric variables, ArrDelay and CRSDepTime, and a categorical variable, DayOfWeek. It is located in the SampleData folder of the RevoScaleR package, so you can easily run this example in your Revolution R Enterprise session.
- Import the sample airline data from a comma-delimited text file to an .xdf file. When we import the data, we convert the string variable to a (categorical) factor variable using stringsAsFactors:
We use the RevoScaleR 'F()' function here, which tells the rxLinMod() function to treat a variable as a 'factor' variable. We also use the ability to create new variables “on-the-fly” by using the transforms argument to create the variable “Weekend”:
test.linmod.fit <- rxLinMod(ArrDelay ~ F(Weekend) : F(CRSDepTime),
transforms=list(Weekend = (DayOfWeek == “Saturday”) | (DayOfWeek == “Sunday”)),
cube = TRUE, data = “airline.xdf”)
The 'test.linmod.fit$countDF' component, contains the group means and cell counts. Since the independent variables in our regression were all categorical, the group means are the same as the coefficients. We can do a quick check by taking the sum of the differences:
linModDF <- test.linmod.fit$countDF
sum(linModDF$ArrDelay – coef(test.linmod.fit))
The output from our linear model estimation includes standard errors of the coefficient estimates. We can use these to create confidence bounds around the estimated coefficients. Let's add them as additional variables in our data frame:
linModDF$coef.std.error <- as.vector(test.linmod.fit$coef.std.error)
linModDF$lowerConfBound <- linModDF$ArrDelay - 2*linModDF$coef.std.error
linModDF$upperConfBound <- linModDF$ArrDelay + 2*linModDF$coef.std.error
The line plot is informative, as it clearly shows that our estimates of arrival delays in the early hours of the morning are not very precise because of the small number of observations.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.