Why to use the replyr R package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I noticed that the R
package sparklyr
had the following odd behavior:
suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #> [1] '0.7.2.9000' packageVersion("sparklyr") #> [1] '0.6.2' packageVersion("dbplyr") #> [1] '1.1.0.9000' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] NA nrow(d) #> [1] NA
This means user code or user analyses that depend on one of dim()
, ncol()
or nrow()
possibly breaks. nrow()
used to return something other than NA
, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr
and dbplyr
users.
tibble::truncate
uses nrow()
” and “print.tbl_spark
is too slow since dbplyr
started using tibble
as the default way of printing records”.
A little digging gets us to this:
The above might make sense if tibble
and dbplyr
were the only users of dim()
, ncol()
or nrow()
.
Frankly if I call nrow()
I expect to learn the number of rows in a table.
The suggestion is for all user code to adapt to use sdf_dim()
, sdf_ncol()
and sdf_nrow()
(instead of tibble
adapting). Even if practical (there are already a lot of existing sparklyr
analyses), this prohibits the writing of generic dplyr
code that works the same over local data, databases, and Spark
(by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr
dbplyr
users (i.e., databases such as PostgreSQL
), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull
“).
I admit, calling nrow()
against an arbitrary query can be expensive. However, I am usually calling nrow()
on physical tables (not on arbitrary dplyr
queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow()
to be a cheap operation.
Allowing the user to write reliable generic code that works against many dplyr
data sources is the purpose of our replyr
package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark
. Below are the functions replyr
supplies for examining the size of tables:
library("replyr") packageVersion("replyr") #> [1] '0.5.4' replyr_hasrows(d) #> [1] TRUE replyr_dim(d) #> [1] 2 1 replyr_ncol(d) #> [1] 1 replyr_nrow(d) #> [1] 2 spark_disconnect(sc)
Note: the above is only working properly in the development version of replyr
, as I only found out about the issue and made the fix recently.
replyr_hasrows()
was added as I found in many projects the primary use of nrow()
was to determine if there was any data in a table. The idea is: user code uses the replyr
functions, and the replyr
functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr
accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).
The point of replyr
is to provide re-usable work arounds of design choices far away from our influence.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.