Site icon R-bloggers

padr::pad does now do group padding

[This article was first published on That’s so Random, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr") for a general introduction. In v0.2.0 the pad function is extended with a group argument, which makes your life a lot easier when you want to do padding within groups.

In the Examples of padr in v0.1.0 I showed that padding over multiple groups could be done by using padr in conjunction with dplyr and tidyr.

library(dplyr)
library(padr)
# padding a data.frame on group level
day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
x_df_grp <- data.frame(grp  = rep(LETTERS[1:3], each = 4),
                       y    = runif(12, 10, 20) %>% round(0),
                       date = sample(day_var, 12, TRUE)) %>%
 arrange(grp, date)

x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>%
  tidyr::fill(grp) 

I quite quickly realized this is an unsatisfactory solution for two reasons:

1) It is a hassle. It is the goal of padr to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal. 2) It does not work when one or both of the start_val and end_val arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with tidyr::fill will incorrectly fill the grouping variable(s) as a result.

It was therefore necessary to expand pad, so the grouping variable(s) do not contain missing values anymore after padding. The group argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows.

The new version of the section in the Examples of padr is:

day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4),
                       grp2 = letters[1:2],
                       y    = runif(12, 10, 20) %>% round(0),
                       date = sample(day_var, 12, TRUE)) %>%
 arrange(grp1, grp2, date)

# pad by one grouping var
x_df_grp %>% pad(group = 'grp1')

# pad by two groups vars
x_df_grp %>% pad(group = c('grp1', 'grp2'))

Bug fixes

Besides the additional argument there were two bug fixes in this version:

v0.2.1

Right before posting this blog, Doug Friedman found out that in v0.2.0 the by argument no longer functioned. This bug was fixed in the patch release v0.2.1.

I hope you (still) enjoy working with padr, let me know when you find a bug or got ideas for improvement.

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.