dplyrXdf 0.90 now available
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Hong Ooi, Sr. Data Scientist, Microsoft
Version 0.90 of the dplyrXdf package has just been released. dplyrXdf is a package that brings dplyr pipelines and data transformation verbs to Microsoft R Server’s xdf files. This version includes several changes, mostly to address performance and efficiency concerns, which I’ll detail these below.
The .outFile argument
All dplyrXdf verbs now support a special argument .outFile
, which determines how the output data is handled. If you don’t specify a value for this argument, the data will be saved to a tbl_xdf
which will be managed by dplyrXdf. This supports the default behaviour, whereby data files are automatically created and deleted inside a pipeline. There are two other options for .outFile
:
-
If you specify
.outFile = NULL
, the data will be returned in memory as a data frame. -
If
.outFile
is a character string giving a file name, the data will be saved to an xdf file at that location, and a persistent xdf data source will be returned.
This should improve the efficiency of pipelines with large datasets, by reducing the amount of I/O. Previously, to save the output of a pipeline, you had to call the persist
verb at the end:
xdf %>% filter(...) %>% mutate(...) %>% persist("final/output.xdf")
In this example, mutate
would save a temporary xdf file in dplyrXdf’s working directory, and persist
would then copy that file to the final output location. Now, you can save the output directly to the final location as follows:
xdf %>% filter(...) %>% mutate(..., .outFile="final/output.xdf")
This omits a redundant file save and copy, thus speeding things up.
The persist
verb remains available, for situations where you have already run a pipeline and want to save its output after the fact.
Setting the dplyrXdf working directory
By default, dplyrXdf will save the data files it creates into the R working directory. On some systems, this may be located on a drive or filesystem that is relatively small; this is rarely an issue with open source R, but can be problematic when working with large xdf files. You can now change the location of the xdf tbl directory with the setXdfTblDir
function:
# set the tbl directory to a network drive (on Windows) setXdfTblDir("n:/Rtemp")
Similarly, you can view the location of the current xdf tbl directory with getXdfTblDir
.
For best performance, you should avoid setting the xdf tbl directory to a remote location/network drive unless you have a fast network connection.
Extraction operators
Sometimes it’s useful to be able to extract variables from an Xdf file. With a data frame, you can do this with the $
and [[
operators: for example iris$Species
and iris[["Species"]]
both return the Species column (as a vector) from the iris dataset. This update to dplyrXdf implements the same functionality for Xdf files:
sampDir <- system.file("sampleData", package="RevoScaleR") airline <- RxXdfData(file.path(sampDir, "AirlineDemoSmall.xdf")) ArrDelay <- airline$ArrDelay head(ArrDelay) ## [1] 6 -8 -2 1 -2 -14
By default, the entire column is returned, so you should be careful using these operators when you have very large Xdf files.
The subset verb
In dplyr, subsetting data is handled by two verbs: filter
for subsetting by rows, and select
for subsetting by columns. This is fine for data frames, where everything runs in memory; and for SQL databases, where the hard work is done by the database. For Xdf files, however, this is suboptimal, as each verb translates into a separate I/O step where the data is read from disk, subsetted, then written out again. This can waste a lot of time with large datasets.
You can get around this by using the .rxArgs
argument in a verb to pass commands directly to the underlying RevoScaleR functions. For example, filter(xdf, .rxArgs=list(varsToKeep=*)))
would subset by rows, and simultaneously use the varsToKeep
parameter to tell rxDataStep
to subset by columns. But this is inelegant. It would be better if there was a verb that could natively subset in both dimensions, without having to rely on workarounds.
As it turns out, base R has a subset generic which (as the name says) performs subsetting on both rows and columns. You’ve probably used it with data frames:
subset(iris, Species == "setosa", c(Sepal.Length, Sepal.Width)) ## Source: local data frame [50 x 2] ## ## Sepal.Length Sepal.Width ## (dbl) (dbl) ## 1 5.1 3.5 ## 2 4.9 3.0 ## 3 4.7 3.2 ## 4 4.6 3.1 ## 5 5.0 3.6 ## 6 5.4 3.9 ## .. ... ...
Here, the first argument to subset specifies the rows, and the second argument the columns to return. The subset method for Xdf files works along the same lines:
airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime)) head(airSubset)
airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime)) head(airSubset) ## Source: local data frame [6 x 2] ## ## ArrDelay CRSDepTime ## (int) (dbl) ## 1 6 9.666666 ## 2 -8 19.916666 ## 3 -2 13.750000 ## 4 1 11.750000 ## 5 -2 6.416667 ## 6 -14 13.833333
You can also use the same helper functions to choose columns as you would with select:
airSubset2 <- subset(airline, , starts_with("A")) names(airSubset2) ## [1] "ArrDelay"
Other changes
In addition to the above, version 0.90 includes the following changes:
-
The
persist
verb now uses the base R functionsfile.copy
andfile.rename
to copy/move a file, which should improve performance considerably on large datasets. -
The code for two-table verbs has been extensively rewritten, and should be much more reliable than before.
-
The documentation, including the vignettes, has been significantly revised.
-
Unit testing infrastructure has been added, utilising the testthat package.
-
Several bugs have been fixed, some found with the aid of the aforementioned unit testing.
The latest version of the dplyrXdf package is available on Github at the link below.
Github: dplyrXdf
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.