Improving data quality with deducorrect
[This article was first published on yaRb, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Does your raw numerical data suffer from typos? sign errors? variable swaps? rounding errors? You may be able to fix all that with the deducorrect package. Today, we (that is Edwin de Jonge, Sander Scholtus and myself) uploaded the, 1.0-0 release to CRAN.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The deducorrect package implements methods to solve common errors in numerical data records. To detect errors, you first have to define the rules which your data has to obey. For example, suppose you have a data.frame with three columns: profit, turnover, and cost, subjected to the rules that all values must be positive, the balance account profit + loss = turnover must hold and the profit-to-turnover ration may not
exceed 0.6 (some kind of sanity check). The rules can be defined as follows:
E <- editmatrix(c( "cost > 0", "profit > 0", "turnover > 0", "cost + profit == turnover", "0.6*turnover >= profit") )
Here, the editmatrix function from the editrules package was used to create an object of class editmatrix, which holds all the information about the restrictions.
Now let’s look at some simple data.
dat <- data.frame( cost = c(-100, 325, 326 ), profit = c( 150, 457, 475 ), turnover = c( 250, 800, 800 ) )
Obviously, every record contains some error. In the first record "cost" is wrongly negative, the second appears to have a typo in the "profit" value and the third record has a rounding error in one of the variables.
Using functions from the deducorrect package, such errors can be repaired:
> (dat <- correctTypos(E,dat)$corrected) cost profit turnover 1 100 150 250 2 325 475 800 3 326 475 800
The sign error disappeared in the first record. Now let's fix the typo:
> (dat <- correctTypos(E,dat)$corrected)
cost profit turnover
1 100 150 250
2 325 475 800
3 326 475 800