Perculiar behaviour of the sum function
[This article was first published on StaTEAstics., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The sum function in R is a special one in contrast to other summary statistics functions such as mean and median. The first distinguish is that it is a Primitive function where the others are not (Although you can call mean using .Internal). This causes many inconsistency and unexpected behaviours.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(1) Inconsistency in argument
For example, the arguments are inconsistent. Both mean and median takes the argument x, while the sum operates on whatever argument that is not matched. This can be a problem in the case when you want to write a function which switches between all the summary functions such as:
do.call(myFUN, list(x = x))
Where myFun can be any statistical summary function. The problem first arises when I wanted to write a function which encompasses several different summary statistics and so I can switch between them when required. The main problem arises when I have to pass additional arguments such as the “weight” in the weighted.mean function. I wrote the following call and naively hope it would work
do.call(myFUN, list(x = x, w = w))
What turns out is that this line of code works find for all the summary statistics except the sum function where the “weight” is also summed. So my current solution is just to use the switch function which is not my favourite function.
(2) Inconsistency in output
Another inconsistency arises in how the NA’s are treated. In the mean, median and weighted.mean summaries; if all the observations are NA then either NA or NaN are returned.
mean(rep(NA, 10), na.rm = TRUE)
median(rep(NA, 10), na.rm = TRUE)
While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.
Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.
sum(x, na.rm = any(!is.na(x)))
“The computations and the software for data analysis should be trustworthy” – John Chamber, Software for Data Analysis
I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.
To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics..
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.