Easy data validation with the validate package

[This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.

library(magrittr)
library(validate)

iris %>% check_that(
  Sepal.Width > 0.5 * Sepal.Length
  , mean(Sepal.Width) > 0
  , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()

#  rule items passes fails nNA error warning                                              expression
# 1   V1   150     66    84   0 FALSE   FALSE                        Sepal.Width > 0.5 * Sepal.Length
# 2   V2     1      1     0   0 FALSE   FALSE                                   mean(Sepal.Width) > 0
# 3   V3   150     84    66   0 FALSE   FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.

In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.

v <-  validator(
  ratio = Sepal.Width > 0.5 * Sepal.Length
  , mean = mean(Sepal.Width) > 0
  , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
  )
v

# Object of class 'validator' with 3 elements:
#  ratio: Sepal.Width > 0.5 * Sepal.Length
#  mean : mean(Sepal.Width) > 0
#  cnd  : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10

We can confront the iris data set with this validator. The results are stored in a validation object.

cf <- confront(iris, v)
cf

# Object of class 'validation'
# Call:
#     confront(x = iris, dat = v)
#
# Confrontations: 3
# With fails    : 2
# Warnings      : 0
# Errors        : 0
barplot(cf,main="iris")

validate-iris These are just the basics of what can be done with this package.

  • If this post got you interested, you can go through our introductory vignette
  • Some theory on data validation can be found here
  • We'd love to hear your suggestions, opinions, bugreports here
  • An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
  • Github repo, CRAN page

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)