Site icon R-bloggers

Control totals of a data.frame

[This article was first published on triKnowBits, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you are conducting a business analysis project with a data extract from the company’s internal system, professional risk management suggests you make sure you are not missing any records or double counting any records. But you certainly don’t want to look at every record. Yikes!

Auditors solve this predicament with control totals. When the sums of key fields and the numbers of records match known values, usually from some well-established “production report,” it can be assumed your data “reconciles.” *

What does it mean to calculate “control totals” of a general data.frame?
For numeric columns it’s obvious to use sum, the only question being what to do with missing values. For non-numeric columns, “sum” is not well-defined. You could say NA, but IMO it would be more informative to show the number of non-NA values.

Hmm, maybe this has already been done before.

A quick internet search for “summarise data.frame”, “total data.frame” and variants led to hits with two shortcomings (see for example Cookbook for R, gdata, and dplyr):
Something more “automatic” would apply to any data.frame without having to know column names or data types ahead of time. Something easy. How hard can that be??

Here is my one-line function called controlTotals:
controlTotals <- function(x, na.rm = FALSE)
  sapply(x, function(x) if (is.numeric(x)) sum(as.numeric(x), na.rm = na.rm) else isum(!is.na(x)))

In typical business use cases, the most important fields to check should not have missing values, so na.rm=FALSE by default. In that case NA’s in a column are detected by
If NA’s in the data are acceptable, then set na.rm=TRUE and the values returned will be
Here is an example.
> extract <- data.frame(a = c(1:3), b = c(1, NA, 3),
                        c = letters[1:3], d = c(“high”, NA, “low”))
> controlTotals(extract)
 a  b  c  d
 6 NA  3  2
> controlTotals(extract, TRUE)
a b c d
6 4 3 2

In the first, default case, column b’s control total is NA and column d’s control total is less than the number of records. Aha! there’s something missing in both columns! Better check that out.


* P&C actuaries note: reconciliation is a professional concern not restricted to Schedule P

sessionInfo()

R version 3.2.4 (2016-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] roxygen2_4.1.1 devtools_1.8.0

loaded via a namespace (and not attached):
[1] magrittr_1.5 rversions_1.0.1 tools_3.2.3 curl_0.9 Rcpp_0.12.3
[6] memoise_0.2.1 xml2_0.1.1 stringi_1.0-1 git2r_0.10.1 stringr_1.0.0
[11] digest_0.6.9

To leave a comment for the author, please follow the link and comment on their blog: triKnowBits.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.