Easy data validation with the validate package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.
library(magrittr) library(validate) iris %>% check_that( Sepal.Width > 0.5 * Sepal.Length , mean(Sepal.Width) > 0 , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) %>% summary() # rule items passes fails nNA error warning expression # 1 V1 150 66 84 0 FALSE FALSE Sepal.Width > 0.5 * Sepal.Length # 2 V2 1 1 0 0 FALSE FALSE mean(Sepal.Width) > 0 # 3 V3 150 84 66 0 FALSE FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
The summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean of a variable only one item is tested: the whole Sepal.Width column. The other rules are tested on each record in iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.
In validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator object supports such activities so validation rules can be reused.
v <- validator( ratio = Sepal.Width > 0.5 * Sepal.Length , mean = mean(Sepal.Width) > 0 , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) v # Object of class 'validator' with 3 elements: # ratio: Sepal.Width > 0.5 * Sepal.Length # mean : mean(Sepal.Width) > 0 # cnd : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
We can confront the iris data set with this validator. The results are stored in a validation object.
cf <- confront(iris, v) cf # Object of class 'validation' # Call: # confront(x = iris, dat = v) # # Confrontations: 3 # With fails : 2 # Warnings : 0 # Errors : 0 barplot(cf,main="iris")
These are just the basics of what can be done with this package.
- If this post got you interested, you can go through our introductory vignette
- Some theory on data validation can be found here
- We'd love to hear your suggestions, opinions, bugreports here
- An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
- Github repo, CRAN page
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.