Site icon R-bloggers

Easy data validation with the validate package

[This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.

library(magrittr)
library(validate)

iris %>% check_that(
  Sepal.Width > 0.5 * Sepal.Length
  , mean(Sepal.Width) > 0
  , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()

#  rule items passes fails nNA error warning                                              expression
# 1   V1   150     66    84   0 FALSE   FALSE                        Sepal.Width > 0.5 * Sepal.Length
# 2   V2     1      1     0   0 FALSE   FALSE                                   mean(Sepal.Width) > 0
# 3   V3   150     84    66   0 FALSE   FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.

In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.

v <-  validator(
  ratio = Sepal.Width > 0.5 * Sepal.Length
  , mean = mean(Sepal.Width) > 0
  , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
  )
v

# Object of class 'validator' with 3 elements:
#  ratio: Sepal.Width > 0.5 * Sepal.Length
#  mean : mean(Sepal.Width) > 0
#  cnd  : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

We can confront the iris data set with this validator. The results are stored in a validation object.

cf <- confront(iris, v)
cf

# Object of class 'validation'
# Call:
#     confront(x = iris, dat = v)
#
# Confrontations: 3
# With fails    : 2
# Warnings      : 0
# Errors        : 0
barplot(cf,main="iris")

These are just the basics of what can be done with this package.

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.