Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago padr
was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr")
for a general introduction. In v0.2.0 the pad
function is extended with a group
argument, which makes your life a lot easier when you want to do padding within groups.
In the Examples of padr
in v0.1.0 I showed that padding over multiple groups could be done by using padr
in conjunction with dplyr
and tidyr
.
library(dplyr) library(padr) # padding a data.frame on group level day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp = rep(LETTERS[1:3], each = 4), y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp, date) x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>% tidyr::fill(grp)
I quite quickly realized this is an unsatisfactory solution for two reasons:
1) It is a hassle. It is the goal of padr
to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal.
2) It does not work when one or both of the start_val
and end_val
arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with tidyr::fill
will incorrectly fill the grouping variable(s) as a result.
It was therefore necessary to expand pad
, so the grouping variable(s) do not contain missing values anymore after padding. The group
argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows.
The new version of the section in the Examples of padr
is:
day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4), grp2 = letters[1:2], y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date) # pad by one grouping var x_df_grp %>% pad(group = 'grp1') # pad by two groups vars x_df_grp %>% pad(group = c('grp1', 'grp2'))
Bug fixes
Besides the additional argument there were two bug fixes in this version:
-
pad
does no longer break when datetime variable contains one value only. Returnsx
and a warning ifstart_val
andend_val
areNULL
, and will do proper padding when one or both are specified. -
In the
fill_
function now a meaningful error is thrown, when forgetting to specify at least one column to apply the function on.
v0.2.1
Right before posting this blog, Doug Friedman found out that in v0.2.0 the by
argument no longer functioned. This bug was fixed in the patch release v0.2.1.
I hope you (still) enjoy working with padr
, let me know when you find a bug or got ideas for improvement.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.