Updated dplyrXdf package brings data munging with pipes to Xdf files

Hong Ooi

6 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Hong Ooi, Sr. Data Scientist, Microsoft

I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (persist), fills some holes in support for dplyr verbs, and fixes various bugs.

The `persist` verb

A side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:

# pipeline 1
output1 <- flightsXdf %>%
    mutate(delay=(arr_delay + dep_delay)/2)

# use the output from pipeline 1
output2 <- output1 %>%
    group_by(carrier) %>%
    summarise(delay=mean(delay))

# reuse the output from pipeline 1 -- WRONG
output3 <- output1 %>%
    group_by(dest) %>%
    summarise(delay=mean(delay))

The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file. However, in this case it isn’t what’s desired.

Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.

The new persist verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.

# pipeline 1 -- use persist to save the data to the working directory
output1 <- flightsXdf %>%
    mutate(delay=(arr_delay + dep_delay)/2) %>% persist("output1.xdf")

# use the output from pipeline 1
output2 <- output1 %>%
    group_by(carrier) %>%
    summarise(delay=mean(delay))

# reuse the output from pipeline 1 -- this works as expected
output3 <- output1 %>%
    group_by(dest) %>%
    summarise(delay=mean(delay))

Specify levels in a `factorise` call

You can now specify the levels for a factor created by factorise, using the standard name=value syntax:

factorise(data, x=c("a", "b", "c"))

This will convert the variable x into a factor with levels a, b and c. Any values that don’t match the given levels will be turned into NAs. If x is already a factor, its levels will be changed to match those specified.

Support for `semi_join` and `anti_join`

The semi_join and anti_join verbs have been implemented. As these types of joins aren’t internally supported by rxMerge, they are done using a combination of other verbs:

# same as semi_join(a, b, by="x")
# select everything in 'a' that matches a value of 'x' in 'b'
semi <- inner_join(a,
                   select(b, x) %>% distinct,
                   by="x")

# same as anti_join(a, b, by="x")
# select everything in 'a' that doesn't match a value of 'x' in 'b'
anti <- left_join(a,
                  transmute(b, x, .ones=rep(1, .rxNumRows)) %>% distinct,
                  by="x") %>% filter(is.na(.ones))

Support unnamed argument for `do` and `doXdf`

You can now use unnamed arguments with do and doXdf, like the native dplyr::do. In both cases, the output has to be coercible to a data frame (again, like dplyr::do).

# example of unnamed argument to do
do_unnamed <- flightsXdf %>%
    group_by(carrier) %>%
    do(data.frame(quantile=sprintf("%d%%", seq(0, 100, by=25)),
                  quant_arr=quantile(.$arr_delay, na.rm=TRUE),
                  quant_dep=quantile(.$dep_delay, na.rm=TRUE)))

# example of unnamed argument to doXdf
do_unnamedXdf <- flightsXdf %>%
    group_by(carrier) %>%
    doXdf(rxSummary(~ arr_delay, .)$sDataFrame)

Miscellaneous bug fixes and improvements

A number of bug fixes have been implemented. In particular, joining tables on factor variables should now work even when the factor levels in the two tables aren’t exactly the same. The mutate_each, summarise_each, count and tally verbs have also been verified to work correctly for Xdf files.

If you encounter any bugs or issues with dplyrXdf, please contact me at hongooi@microsoft.com.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The persist verb

Specify levels in a factorise call

Support for semi_join and anti_join

Support unnamed argument for do and doXdf

Miscellaneous bug fixes and improvements

The `persist` verb

Specify levels in a `factorise` call

Support for `semi_join` and `anti_join`

Support unnamed argument for `do` and `doXdf`