Easy data validation with the validate package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The validate
package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.
library(magrittr) library(validate) iris %>% check_that( Sepal.Width > 0.5 * Sepal.Length , mean(Sepal.Width) > 0 , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) %>% summary() # rule items passes fails nNA error warning expression # 1 V1 150 66 84 0 FALSE FALSE Sepal.Width > 0.5 * Sepal.Length # 2 V2 1 1 0 0 FALSE FALSE mean(Sepal.Width) > 0 # 3 V3 150 84 66 0 FALSE FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
The summary
gives an overview of the number of items checked. For an aggregated test, such as the one where we test the mean
of a variable only one item is tested: the whole Sepal.Width
column. The other rules are tested on each record in iris
. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.
In validate
, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The validator
object supports such activities so validation rules can be reused.
v <- validator( ratio = Sepal.Width > 0.5 * Sepal.Length , mean = mean(Sepal.Width) > 0 , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) v # Object of class 'validator' with 3 elements: # ratio: Sepal.Width > 0.5 * Sepal.Length # mean : mean(Sepal.Width) > 0 # cnd : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
We can confront the iris
data set with this validator. The results are stored in a validation
object.
cf <- confront(iris, v) cf # Object of class 'validation' # Call: # confront(x = iris, dat = v) # # Confrontations: 3 # With fails : 2 # Warnings : 0 # Errors : 0 barplot(cf,main="iris")
These are just the basics of what can be done with this package.
- If this post got you interested, you can go through our introductory vignette
- Some theory on data validation can be found here
- We'd love to hear your suggestions, opinions, bugreports here
- An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
- Github repo, CRAN page
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.