Updated dplyrXdf package brings data munging with pipes to Xdf files
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Hong Ooi, Sr. Data Scientist, Microsoft
I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (persist
), fills some holes in support for dplyr verbs, and fixes various bugs.
The persist
verb
A side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:
# pipeline 1 output1 <- flightsXdf %>% mutate(delay=(arr_delay + dep_delay)/2) # use the output from pipeline 1 output2 <- output1 %>% group_by(carrier) %>% summarise(delay=mean(delay)) # reuse the output from pipeline 1 -- WRONG output3 <- output1 %>% group_by(dest) %>% summarise(delay=mean(delay))
The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file. However, in this case it isn’t what’s desired.
Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.
The new persist
verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.
# pipeline 1 -- use persist to save the data to the working directory output1 <- flightsXdf %>% mutate(delay=(arr_delay + dep_delay)/2) %>% persist("output1.xdf") # use the output from pipeline 1 output2 <- output1 %>% group_by(carrier) %>% summarise(delay=mean(delay)) # reuse the output from pipeline 1 -- this works as expected output3 <- output1 %>% group_by(dest) %>% summarise(delay=mean(delay))
Specify levels in a factorise
call
You can now specify the levels for a factor created by factorise
, using the standard name=value syntax:
factorise(data, x=c("a", "b", "c"))
This will convert the variable x
into a factor with levels a
, b
and c
. Any values that don’t match the given levels will be turned into NAs. If x
is already a factor, its levels will be changed to match those specified.
Support for semi_join
and anti_join
The semi_join
and anti_join
verbs have been implemented. As these types of joins aren’t internally supported by rxMerge
, they are done using a combination of other verbs:
# same as semi_join(a, b, by="x") # select everything in 'a' that matches a value of 'x' in 'b' semi <- inner_join(a, select(b, x) %>% distinct, by="x") # same as anti_join(a, b, by="x") # select everything in 'a' that doesn't match a value of 'x' in 'b' anti <- left_join(a, transmute(b, x, .ones=rep(1, .rxNumRows)) %>% distinct, by="x") %>% filter(is.na(.ones))
Support unnamed argument for do
and doXdf
You can now use unnamed arguments with do
and doXdf
, like the native dplyr::do
. In both cases, the output has to be coercible to a data frame (again, like dplyr::do
).
# example of unnamed argument to do do_unnamed <- flightsXdf %>% group_by(carrier) %>% do(data.frame(quantile=sprintf("%d%%", seq(0, 100, by=25)), quant_arr=quantile(.$arr_delay, na.rm=TRUE), quant_dep=quantile(.$dep_delay, na.rm=TRUE))) # example of unnamed argument to doXdf do_unnamedXdf <- flightsXdf %>% group_by(carrier) %>% doXdf(rxSummary(~ arr_delay, .)$sDataFrame)
Miscellaneous bug fixes and improvements
A number of bug fixes have been implemented. In particular, joining tables on factor variables should now work even when the factor levels in the two tables aren’t exactly the same. The mutate_each
, summarise_each
, count
and tally
verbs have also been verified to work correctly for Xdf files.
If you encounter any bugs or issues with dplyrXdf, please contact me at [email protected].
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.