Rule Your Data with Tidy Validation Reports. Design
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Prologue
Some time ago I had a task to write data validation code. As for most R practitioners, this led to exploration of present solutions. I was looking for a package with the following features:
- Relatively small amount of time should be spent learning before comfortable usage. Preferably, it should be built with tidyverse in mind.
- It should be quite flexible in terms of types of validation rules.
- Package should offer functionality for both validations (with relatively simple output format) and assertions (with relatively flexible behaviour).
- Pipe-friendliness.
- Validating only data frames would be enough.
After devoting couple of days to research, I didn’t find any package fully (subjectively) meeting my needs (for a composed list look here). So I decided to write one myself. More precisely, it turned out into not one but two packages: ruler and keyholder, which powers some of ruler
’s functionality.
This post is a rather long story about key moments in the journey of ruler
’s design process. To learn other aspects see its README (for relatively brief introduction) or vignettes (for more thorough description of package capabilities).
Overview
In my mind, the whole process of data validation should be performed in the following steps:
- Create conditions (rules) for data to meet.
- Expose data to them and obtain some kind of unified report as a result.
- Act based on the report.
The design process went through a little different sequence of definition steps:
Of course, there was switching between these items in order to ensure they would work well together, but I feel this order was decisive for the end result.
suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler)
Validation result
Dplyr data units
I started with an attempt of simple and clear formulation of validation: it is a process of checking whether something satisfies certain conditions. As it was enough to be only validating data frames, something should be thought of as parts of data frame which I will call data units. Certain conditions might be represented as functions, which I will call rules, associated with some data unit and which return TRUE
, if condition is satisfied, and FALSE
otherwise.
I decided to make dplyr package a default tool for creating rules. The reason is, basically, because it satisfies most conditions I had in mind. Also I tend to use it for interactive validation of data frames, as, I am sure, many more R users. Its pipe-friendliness creates another important feature: interactive code can be transformed into a function just by replacing the initial data frame variable by a dot .
. This will create a functional sequence, “a function which applies the entire chain of right-hand sides in turn to its input.”.
dplyr
offers a set of tools for operating with the following data units (see comments):
is_integerish <- function(x) {all(x == as.integer(x))} z_score <- function(x) {abs(x - mean(x)) / sd(x)} mtcars_tbl <- mtcars %>% as_tibble() # Data frame as a whole validate_data <- . %>% summarise(nrow_low = n() >= 15, nrow_up = n() <= 20) mtcars_tbl %>% validate_data() ## # A tibble: 1 x 2 ## nrow_low nrow_up ## <lgl> <lgl> ## 1 TRUE FALSE # Group as a whole validate_groups <- . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7) %>% ungroup() mtcars_tbl %>% validate_groups() ## # A tibble: 4 x 3 ## vs am vs_am_low ## <dbl> <dbl> <lgl> ## 1 0 0 TRUE ## 2 0 1 FALSE ## 3 1 0 TRUE ## 4 1 1 TRUE # Column as a whole validate_columns <- . %>% summarise_if(is_integerish, funs(is_enough_sum = sum(.) >= 14)) mtcars_tbl %>% validate_columns() ## # A tibble: 1 x 6 ## cyl_is_enough_sum hp_is_enough_sum vs_is_enough_sum am_is_enough_sum ## <lgl> <lgl> <lgl> <lgl> ## 1 TRUE TRUE TRUE FALSE ## # ... with 2 more variables: gear_is_enough_sum <lgl>, ## # carb_is_enough_sum <lgl> # Row as a whole validate_rows <- . %>% filter(vs == 1) %>% transmute(is_enough_sum = rowSums(.) >= 200) mtcars_tbl %>% validate_rows() ## # A tibble: 14 x 1 ## is_enough_sum ## <lgl> ## 1 TRUE ## 2 TRUE ## 3 TRUE ## 4 TRUE ## 5 TRUE ## # ... with 9 more rows # Cell validate_cells <- . %>% transmute_if(is.numeric, funs(is_out = z_score(.) > 1)) %>% slice(-(1:5)) mtcars_tbl %>% validate_cells() ## # A tibble: 27 x 11 ## mpg_is_out cyl_is_out disp_is_out hp_is_out drat_is_out wt_is_out ## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> ## 1 FALSE FALSE FALSE FALSE TRUE FALSE ## 2 FALSE TRUE TRUE TRUE FALSE FALSE ## 3 FALSE TRUE FALSE TRUE FALSE FALSE ## 4 FALSE TRUE FALSE FALSE FALSE FALSE ## 5 FALSE FALSE FALSE FALSE FALSE FALSE ## # ... with 22 more rows, and 5 more variables: qsec_is_out <lgl>, ## # vs_is_out <lgl>, am_is_out <lgl>, gear_is_out <lgl>, carb_is_out <lgl>
Tidy data validation report
After realizing this type of dplyr
structure, I noticed the following points.
In order to use dplyr
as tool for creating rules, there should be one extra level of abstraction for the whole functional sequence. It is not a rule but rather a several rules. In other words, it is a function that answers multiple questions about one type of data unit. I decided to call this rule pack or simply pack.
In order to identify, whether some data unit obeys some rule, one needs to describe that data unit, rule and result of validation. Descriptions of last two are simple: for rule it is a combination of pack and rule names (which should always be defined) and for validation result it is value TRUE
or FALSE
.
Description of data unit is trickier. After some thought, I decided that the most balanced way to do it is with two variables:
- var (character) which represents the variable name of data unit:
- Value “.all” is reserved for “all columns as a whole”.
- Value equal to some column name indicates column of data unit.
- Value not equal to some column name indicates the name of group: it is created by uniting (with delimiter) group levels of grouping columns.
- id (integer) which represents the row index of data unit:
- Value 0 is reserved for “all rows as a whole”.
- Value not equal to 0 indicates the row index of data unit.
Combinations of these variables describe all mentioned data units:
var == '.all'
andid == 0
: Data as a whole.var != '.all'
andid == 0
: Group (var
shouldn’t be an actual column name) or column (var
should be an actual column name) as a whole.var == '.all'
andid != 0
: Row as a whole.var != '.all'
andid != 0
: Described cell.
With this knowledge in mind, I decided that the tidy data validation report should be a tibble with the following columns:
pack
: Pack name. rule
: Rule name inside pack. var
: Variable name of data unit. id
: Row index of data unit. value
: Whether the described data unit obeys the rule.
Exposure
Using only described report as validation output is possible if only information about breakers (data units which do not obey respective rules) is interesting. However, reproducibility is a great deal in R community, and keeping information about call can be helpful for future use.
This idea led to creation of another object in ruler
called packs info. It is also a tibble which contains all information about exposure call:
- name
: Name of the rule pack. This column is used to match column pack
in tidy report. - type
: Name of pack type. Indicates which data unit pack checks. - fun
- : List of actually used rule pack functions.
- remove_obeyers
: Value of convenience argument of the future expose
function. It indicates whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.
To fully represent validation, described two tibbles should be returned together. So the actual validation result is decided to be exposure which is basically an S3 class list with two tibbles packs_info
and report
. This data structure is fairly easy to understand and use. For example, exposures can be binded together which is useful for combining several validation results. Also its elements are regular tibbles which can be filtered, summarised, joined, etc.
Rules definition
Interpretation of dplyr output
I was willing to use pure dplyr
in creating rule packs, i.e. without extra knowledge of data unit to be validated. However, I found it impossible to do without experiencing annoying edge cases. Problem with this approach is that all of dplyr
outputs are tibbles with similar structures. The only differentiating features are:
summarise
without grouping returns tibble with one row and user-defined column names.summarise
with grouping returns tibble with number of rows equal to number of summarised groups. Columns consist from grouping and user-defined ones.transmute
returns tibble with number of rows as in input data frame and user-defined column names.- Scoped variants of
summarise
andtransmute
differ from regular ones in another mechanism of creating columns. They apply all supplied functions to all chosen columns. Resulting names are “the shortest … needed to uniquely identify the output”. It means that:- In case of one function they are column names.
- In case of more than one function and one column they are function names.
- In case of more than one column and function they are combinations of column and function names, pasted with character
_
(which, unfortunately, is hardcoded). To force this behaviour in previous cases both columns and functions should be named inside of helper functions vars and funs. To match output columns with combination of validated column and rule, this option is preferred. However, there is a need of different separator between column and function names, as character_
is frequently used in column names.
The first attempt was to use the following algorithm to interpret (identify validated data unit) the output:
- If there is at least one non-logical column then groups are validated. The reason is that in most cases grouping columns are character or factor ones. This already introduces edge case with logical grouping columns.
- Combination of whether number of rows equals 1 (
n_rows_one
) and presence of name separator in all column names (all_contain_sep
) is used to make interpretation:- If
n_rows_one == TRUE
andall_contain_sep == FALSE
then data is validated. - If
n_rows_one == TRUE
andall_contain_sep == TRUE
then columns are validated. - If
n_rows_one == FALSE
andall_contain_sep == FALSE
then rows are validated. This introduces an edge case when output has one row which is intended to be validated. It will be interpreted as ‘data as a whole’. - If
n_rows_one == FALSE
andall_contain_sep == TRUE
then cells are validated. This also has edge case when output has one row in which cells are intended to be validated. It will be interpreted as ‘columns as a whole’.
- If
Despite of having edge cases, this algorithm is good for guessing the validated data unit, which can be useful for interactive use. Its important prerequisite is to have a simple way of forcing extended naming in scoped dplyr
verbs with custom rarely used separator.
Pack creation
Research of pure dplyr-style way of creating rule packs left no choice but to create a mechanism of supplying information about data unit of interest along with pack functions. It consists of following important principles.
Use ruler
’s function rules()
instead of funs()
. Its goals are to force usage of full naming in scoped dplyr
verbs as much as possible and impute missing rule names (as every rule should be named for validation report). rules
is just a wrapper for funs
but with extra functionality of naming its every output element and adding prefix to that names (which will be used as a part of separator between column and rule name). By default prefix is a string ._.
. It is chosen for its, hopefully, rare usage inside column names and symbolism (it is the Morse code of letter ‘R’).
funs(mean, sd) ## <fun_calls> ## $ mean: mean(.) ## $ sd : sd(.) rules(mean, sd) ## <fun_calls> ## $ ._.rule..1: mean(.) ## $ ._.rule..2: sd(.) rules(mean, sd, .prefix = "___") ## <fun_calls> ## $ ___rule..1: mean(.) ## $ ___rule..2: sd(.) rules(fn_1 = mean, fn_2 = sd) ## <fun_calls> ## $ ._.fn_1: mean(.) ## $ ._.fn_2: sd(.)
Note that in case of using only one column in scoped verb it should be named within dplyr::vars
in order to force full naming.
Use functions supported by keyholder
to build rule packs. One of the main features I was going to implement is a possibility of validating only a subset of all possible data units. For example, validation of only last two rows (or columns) of data frame. There is no problem with columns: they can be specified with summarise_at
. However, the default way of specifying rows is by subsetting data frame, after which all information about original row position is lost. To solve this, I needed a mechanism of tracking rows as invisibly for user as possible. This led to creation of keyholder package (which is also on CRAN now). To learn details about it go to its site or read my previous post.
Use specific rule pack wrappers for certain data units. Their goal is to create S3 classes for rule packs in order to carry information about data unit of interest through exposing process. All of them always return a list with supplied functions but with changed attribute class
(with additional group_vars
and group_sep
for group_packs()
). Note that packs might be named inside these functions, which is recommended. If not, names will be imputed during exposing process. Also note that supplied functions are not checked to be correct in terms of validating specified data unit. This is done during exposure (exposing process).
# Data unit. Rule pack is manually named 'my_data' my_data_packs <- data_packs(my_data = validate_data) map(my_data_packs, class) ## $my_data ## [1] "data_pack" "rule_pack" "fseq" "function" # Group unit. Need to supply grouping variables explicitly my_group_packs <- group_packs(validate_groups, .group_vars = c("vs", "am")) map(my_group_packs, class) ## [[1]] ## [1] "group_pack" "rule_pack" "fseq" "function" # Column unit. Need to be rewritten using `rules` my_col_packs <- col_packs( my_col = . %>% summarise_if(is_integerish, rules(is_enough_sum = sum(.) >= 14)) ) map(my_col_packs, class) ## $my_col ## [1] "col_pack" "rule_pack" "fseq" "function" # Row unit. One can supply several rule packs my_row_packs <- row_packs( my_row_1 = validate_rows, my_row_2 = . %>% transmute(is_vs_one = vs == 1) ) map(my_row_packs, class) ## $my_row_1 ## [1] "row_pack" "rule_pack" "fseq" "function" ## ## $my_row_2 ## [1] "row_pack" "rule_pack" "fseq" "function" # Cell unit. Also needs to be rewritten using `rules`. my_cell_packs <- cell_packs( my_cell = . %>% transmute_if(is.numeric, rules(is_out = z_score(.) > 1)) %>% slice(-(1:5)) ) map(my_cell_packs, class) ## $my_cell ## [1] "cell_pack" "rule_pack" "fseq" "function"
Exposing process
After sorting things out with formats of validation result and rule packs it was time to combine them in the main ruler
’s function: expose()
. I had the following requirements:
- It should be insertable inside common
%>%
pipelines as smoothly and flexibly as possible. Two main examples are validating data frame before performing some operations with it and actually obtaining results of validation. - There should be possibility of sequential apply of
expose
with different rule packs. In this case exposure (validation report) after first call should be updated with new exposure. In other words, the result should be as if those rule packs were both supplied inexpose
by one call.
These requirements led to the following main design property of expose
: it never modifies content of input data frame but possibly creates or updates attribute exposure
with validation report. To access validation data there are wrappers get_exposure()
, get_report()
and get_packs_info()
. The whole exposing process can be described as follows:
- Apply all supplied rule packs to keyed with
keyholder::use_id
version of input data frame. - Impute names of rule packs based on possible present exposure (from previous use of
expose
) and validated data units. - Bind possible present exposure with new ones and create/update attribute
exposure
with it.
Also it was decided (for flexibility and convenience) to add following arguments to expose
:
.rule_sep
. It is a regular expression used to delimit column and function names in the output of scopeddplyr
verbs. By default it is a string._.
possibly surrounded by punctuation characters. This is done to account ofdplyr
’s hardcoded use of_
in scoped verbs. Note that.rule_sep
should take into account separator used inrules()
..remove_obeyers
. It is a logical argument indicating whether to automatically remove elements, which obey rules, from tidy validation report. It can be very useful because the usual result of validation is a handful of rule breakers. Without possibility of setting.remove_obeyers
toTRUE
(which is default) validation report will grow unnecessary big..guess
. By defaultexpose
guesses the type of unsupported rule pack type with algorithm described before. In order to write strict and robust code this can be set toFALSE
in which case error will be thrown after detecting unfamiliar pack type.
Some examples:
mtcars_tbl %>% expose(my_data_packs, my_col_packs) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## <chr> <chr> <list> <lgl> ## 1 my_data data_pack <S3: data_pack> TRUE ## 2 my_col col_pack <S3: col_pack> TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## <chr> <chr> <chr> <int> <lgl> ## 1 my_data nrow_up .all 0 FALSE ## 2 my_col is_enough_sum am 0 FALSE # Note that `id` starts from 6 as rows 1:5 were removed from validating mtcars_tbl %>% expose(my_cell_packs, .remove_obeyers = FALSE) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## <chr> <chr> <list> <lgl> ## 1 my_cell cell_pack <S3: cell_pack> FALSE ## ## Tidy data validation report: ## # A tibble: 297 x 5 ## pack rule var id value ## <chr> <chr> <chr> <int> <lgl> ## 1 my_cell is_out mpg 6 FALSE ## 2 my_cell is_out mpg 7 FALSE ## 3 my_cell is_out mpg 8 FALSE ## 4 my_cell is_out mpg 9 FALSE ## 5 my_cell is_out mpg 10 FALSE ## # ... with 292 more rows # Note name imputation and guessing mtcars_tbl %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(validate_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## <chr> <chr> <list> <lgl> ## 1 my_data data_pack <S3: data_pack> FALSE ## 2 row_pack..1 row_pack <S3: row_pack> TRUE ## ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## <chr> <chr> <chr> <int> <lgl> ## 1 my_data nrow_low .all 0 TRUE ## 2 my_data nrow_up .all 0 FALSE ## 3 row_pack..1 is_enough_sum .all 19 FALSE
Act after exposure
After creating data frame with attribute exposure
, it is pretty straightforward to design how to perform any action. It is implemented in function act_after_exposure
with the following arguments:
.tbl
which should be the result of usingexpose()
..trigger
: function which takes.tbl
as argument and returns TRUE if some action needs to be performed.actor
: function which takes.tbl
as argument and performs the action.
Basically act_after_exposure() is doing the following:
- Check that
.tbl
has a proper exposure attribute. - Compute whether to perform intended action by computing
.trigger(.tbl)
. - If trigger results in
TRUE
then.actor(.tbl)
is returned. In other case.tbl
is returned.
It is a good idea that .actor should be doing one of two things:
- Making side effects. For example throwing an error (if condition in
.trigger
is met), printing some information and so on. In this case it should return.tbl
to be used properly inside a pipe. - Changing
.tbl
based on exposure information. In this case it should return the imputed version of.tbl
.
As a main use case, ruler
has function assert_any_breaker
. It is a wrapper for act_after_exposure
with .trigger
checking presence of any breaker in exposure and .actor
being notifier about it.
mtcars_tbl %>% expose(my_data_packs) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## <chr> <chr> <chr> <int> <lgl> ## 1 my_data nrow_up .all 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure.
Conclusions
- Design process of a package deserves its own story.
- Package ruler offers tools for dplyr-style exploration and validation of data frame like objects. With its help validation is done with 3 commands/steps each designed for specific purpose.
sessionInfo() ## R version 3.4.2 (2017-09-28) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/openblas-base/libblas.so.3 ## LAPACK: /usr/lib/libopenblasp-r0.2.18.so ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2 ruler_0.1.0 purrr_0.2.4 dplyr_0.7.4 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.13 knitr_1.17 bindr_0.1 magrittr_1.5 ## [5] tidyselect_0.2.3 R6_2.2.2 rlang_0.1.4 stringr_1.2.0 ## [9] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 rprojroot_1.2 ## [13] digest_0.6.12 assertthat_0.2.0 tibble_1.3.4 bookdown_0.5 ## [17] tidyr_0.7.2 glue_1.2.0 evaluate_0.10.1 rmarkdown_1.7 ## [21] blogdown_0.2 stringi_1.1.5 compiler_3.4.2 keyholder_0.1.1 ## [25] backports_1.1.1 pkgconfig_2.0.1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.