Why to use the replyr R package

Posted on August 31, 2017 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).

Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: “tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr")
packageVersion("replyr")
#> [1] '0.5.4'

replyr_hasrows(d)
#> [1] TRUE
replyr_dim(d)
#> [1] 2 1
replyr_ncol(d)
#> [1] 1
replyr_nrow(d)
#> [1] 2

spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Why to use the replyr R package

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)