Site icon R-bloggers

seplyr 0.5.8 Now Available on CRAN

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Tools such as seplyr, wrapr or rlang are needed when you (the data scientist temporarily working on a programming sub-task) do not know the names of the columns you want your code to be working with. These are situations where you expect the column names to be made available later, in additional variables or parameters.

For an example: suppose we have following data where for two rows (identified by the “id” column) we have two measurements each (identified by the column names “measurement1” and “measurement2”).

library("wrapr")

d <- build_frame(
   'id', 'measurement1', 'measurement2' |
   1   , 'a'           , 10             |
   2   , 'b'           , 20             )

print(d)

#   id measurement1 measurement2
# 1  1            a           10
# 2  2            b           20

Further suppose we wished to have each measurement in its own row (which is often required, such as when using the ggplot2 package to produce plots). In this case we need a tool to convert the data format. If we are doing this as part of an ad-hoc analysis (i.e. we can look at the data and find the column names at the time of coding) we can use tidyr to perform the conversion:

library("tidyr")

gather(d,
       key = value_came_from_column,
       value = value_was,
       measurement1, measurement2)

#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

Notice, however, all column names are specified in gather() without quotes. The names are taken from unexecuted versions of the actual source code of the arguments to gather(). This is somewhat convenient for the analyst (they can skip writing a few quote marks), but a severe limitation imposed on the script writer or programmer (they have problems taking the names of columns from other sources).

seplyr now supplies a standard value oriented interface for gather(). With seplyr we can write code such as the following:

library("seplyr")

gather_se(d,
  key = "value_came_from_column",
  value = "value_was",
  columns = c("measurement1", "measurement2"))
  
#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

This sort of interface is handy when the names of the columns are coming from elsewhere, in variables. Here is an example of that situation:

# pretend these assignments are done elsewhere
# by somebody else
key_col_name <- "value_came_from_column"
value_col_name <- "value_was"
value_columns <- c("measurement1", "measurement2")

# we can use the above values with
# code such as this
gather_se(d,
          key = key_col_name,
          value = value_col_name,
          columns = value_columns)

#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

There are ways to use gather() with “to be named later” column names directly, but it is not simple as it neeedlessly forces the user to master a number of internal implementation details of rlang and dplyr. From documentation and “help(gather)” we can deduce at least 3 related “pure tidyeval/rlang” programming over gather() solutions:

# possibly the solution hinted at in help(gather)
gather(d,
       key = !!key_col_name,
       value = !!value_col_name,
       dplyr::one_of(value_columns))

# concise rlang solution
gather(d,
       key = !!key_col_name,
       value = !!value_col_name,
       !!!value_columns)

# fully qualified rlang solution
gather(d,
       key = !!rlang::sym(key_col_name),
       value = !!rlang::sym(value_col_name),
       !!!rlang::syms(value_columns))

In all cases the user must prepare and convert values for use. Really this is showing gather() does not conveniently expect parametric columns (column names supplied by variables or parameters), but will accept a work-around if the user re-codes column names in some way (some combination of quoting and de-quoting). With “gather_se()” the tool expects to take values and the user does not have to make special arrangements (or remember special notation) to supply them.

Our advice for analysts is:

In addition to wrapping a number of dplyr functions and tidyr::gather()/tidyr::spread(), seplyr 0.5.8 now also wraps tidyr::complete() (thanks to a contribution from Richard Layton).

We hope you try seplyr out both in your work and in your teaching.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.