Control totals of a data.frame
[This article was first published on triKnowBits, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you are conducting a business analysis project with a data extract from the company’s internal system, professional risk management suggests you make sure you are not missing any records or double counting any records. But you certainly don’t want to look at every record. Yikes!Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Auditors solve this predicament with control totals. When the sums of key fields and the numbers of records match known values, usually from some well-established “production report,” it can be assumed your data “reconciles.” *
What does it mean to calculate “control totals” of a general data.frame?
For numeric columns it’s obvious to use sum, the only question being what to do with missing values. For non-numeric columns, “sum” is not well-defined. You could say NA, but IMO it would be more informative to show the number of non-NA values.
Hmm, maybe this has already been done before.
A quick internet search for “summarise data.frame”, “total data.frame” and variants led to hits with two shortcomings (see for example Cookbook for R, gdata, and dplyr):
- You can calculate sum, or a version of count, but not both
- You must know the names of the columns in advance.
Something more “automatic” would apply to any data.frame without having to know column names or data types ahead of time. Something easy. How hard can that be??
Here is my one-line function called controlTotals:
controlTotals <- function(x, na.rm = FALSE)
sapply(x, function(x) if (is.numeric(x)) sum(as.numeric(x), na.rm = na.rm) else isum(!is.na(x)))
sapply(x, function(x) if (is.numeric(x)) sum(as.numeric(x), na.rm = na.rm) else isum(!is.na(x)))
In typical business use cases, the most important fields to check should not have missing values, so na.rm=FALSE by default. In that case NA’s in a column are detected by
- numeric: NA returned
- non-numeric: the number of non-NA values will be less than the number of rows.
- numeric: the sum of all non-NA values (or NA if all are NA)
- non-numeric: always the number of non-NA values.
> extract <- data.frame(a = c(1:3), b = c(1, NA, 3),
c = letters[1:3], d = c(“high”, NA, “low”))
> controlTotals(extract)
a b c d
6 NA 3 2
> controlTotals(extract, TRUE)
a b c d
6 4 3 2
c = letters[1:3], d = c(“high”, NA, “low”))
> controlTotals(extract)
a b c d
6 NA 3 2
> controlTotals(extract, TRUE)
a b c d
6 4 3 2
In the first, default case, column b’s control total is NA and column d’s control total is less than the number of records. Aha! there’s something missing in both columns! Better check that out.
- R-technicality: In the definition of controlTotals you may be wondering why as.numeric is called after is.numeric has already been established. That’s because integer’s are numeric, and summing integers can result in an overflow if the field contains large values, as when whole numbers are used for “record ID’s.” R recovers gracefully with only a warning in those situations, but error, recovery and warning are easily avoided with this extra step. BTW, this extra step was found in the code for gdata‘s nobs method (kudos Gregory Warnes et. al.)
* P&C actuaries note: reconciliation is a professional concern not restricted to Schedule P
sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] roxygen2_4.1.1 devtools_1.8.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 rversions_1.0.1 tools_3.2.3 curl_0.9 Rcpp_0.12.3
[6] memoise_0.2.1 xml2_0.1.1 stringi_1.0-1 git2r_0.10.1 stringr_1.0.0
[11] digest_0.6.9
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] roxygen2_4.1.1 devtools_1.8.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 rversions_1.0.1 tools_3.2.3 curl_0.9 Rcpp_0.12.3
[6] memoise_0.2.1 xml2_0.1.1 stringi_1.0-1 git2r_0.10.1 stringr_1.0.0
[11] digest_0.6.9
To leave a comment for the author, please follow the link and comment on their blog: triKnowBits.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.