The Zero Bug

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.


The zero bug

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem.

Our Example

For our example problem please consider aggregating significant earthquake events in the United States of America.

To do this we will start with data from:

The National Geophysical Data Center / World Data Service (NGDC/WDS): Significant Earthquake Database, National Geophysical Data Center, NOAA, doi:10.7289/V5TD9V7K.

They database is described thusly:

The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.

I queried the form for “North America and Hawaii”:”USA” in tab delimited form. For simplicity and reproducibility I saved a the result in the url given in the R example below. Starting our example we can use R to load the data from the url and start our project.

url <- 'http://www.win-vector.com/dfiles/earthquakes.tsv'
d <- read.table(url, 
                header=TRUE, 
                stringsAsFactors = FALSE,
                sep='\t')
View(d)
head(d[,c('I_D','YEAR','INTENSITY','STATE')])
#    I_D YEAR INTENSITY STATE
# 1 6697 1500        NA    HI
# 2 6013 1668         4    MA
# 3 9954 1700        NA    OR
# 4 5828 1755         8    MA
# 5 5926 1788         7    AK
# 6 5927 1788        NA    AK

We see the data is organizes such that each row is an event (with I_D), and contains many informational columns including “YEAR” (the year the event happened) and “STATE” (the state abbreviation where the event happened). Using R tools and packages we can immediately start to summarize and visualize the data.

For example: we can count modern (say 1950 and later) US earthquakes by year.

library("ggplot2")
library("dplyr")

dModern <- d[d$YEAR>=1950, , drop=FALSE]

# histogram
ggplot(data=dModern, mapping=aes(x=YEAR)) +
  geom_histogram(binwidth=1) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

Or we can get use dplyr to build the count summaries by hand and present the summary in a stem-plot instead of the histogram.

# aggregate the data
byYear <- dModern %>% 
  group_by(YEAR) %>%
  summarize(count = n()) %>%
  arrange(YEAR)

# produce a stem-style plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

The problem

We have already snuck in the mistake. The by-hand aggregation “byYear” is subtly wrong. The histogram is almost correct (given its graphical convention of showing counts as height), but the stem plot is revealing problems.

Here is the problem:

summary(byYear$count)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    1.00    1.00    2.00    2.50    3.75    7.00 

Notice the above summary implies the minimum number of significant earthquakes seen in the United States in a modern year is 1. Looking at our graphs we can see it should in fact be 0. This wasn’t so bad for graphing, but can be disastrous in calculation or in directing action.

The fix

This is the kind of situation that anti_join is designed to fix (and is how replyr::replyr_coalesce deals with the problem).

A simple ad-hoc fix I recommend is to build a second synthetic summary frame that carries explicit zero counts. We then add these counts into our aggregation and get correct summaries.

For example the following code:

# add in zero summaries
zeroObs <- data.frame(YEAR=1950:2016, count=0)
byYear <- rbind(byYear, zeroObs) %>%
  group_by(YEAR) %>%
  summarize(count = sum(count)) %>%
  arrange(YEAR)

# re-plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

gives us the corrected stem plot:

NewImage

The figure above is the correct stem-plot. Remember: while a histogram denotes counts by filled-heights of bars, a stem-plot denotes counts by positions of visible points. The point being: in a proper stem-plot zero counts are not invisible (and are in fact distinguishable from missing summaries).

A harder example

This issue may seem trivial but that is partly because I deliberately picked a simple example where it is obvious there is missing data. This is not always the case. Remember: hidden errors can be worse than visible errors. In fact even in the original histogram it was not obvious what to think about the missing years 1950 (which apparently had no significant US earthquakes) and 2017 (which is an incomplete year). We really have to explicitly specify the complete universe (also called range or support set) of keys (as in YEAR=1950:2016, and not use a convenience such as YEAR=min(dModern$YEAR):max(dModern$YEAR)).

This is much more obvious if we summarize by state instead of year.

byState <- dModern %>% 
  group_by(STATE) %>%
  summarize(count = n()) %>%
  arrange(STATE)

ggplot(data=byState, mapping=aes(x=STATE, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=STATE, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byState$count))) +
  ggtitle('number of modern USA earthquakes by state')

NewImage

We do not have a proper aggregation where each state count is represented with an explicit zero, and not by implicit missingness (which can’t differentiate states with zero-counts from non-states or misspellings). To produce the proper summary we need a list of US state abbreviations to merge in.

A user of the above graph isn’t likely to be too confused. Such a person likely knows there are 50 states and will presume the missing states have zero counts. However downstream software (which can be punishingly literal) won’t inject such domain knowledge and will work on the assumption that there are 17 states and all states have had significant earthquakes in modern times. Or suppose we are aggregating number treatments given to a set of patients; in this case the unacceptable confusion of not-present and zero-counts can really hide huge, problems: such as not noticing some patients never received treatment.

R actually has list of state abbreviations (in “datasets::state.abb“) and we can use replyr::replyr_coalesce to quickly fix our issue:

library('replyr')

support <- data.frame(STATE= c('', datasets::state.abb),
                      stringsAsFactors = FALSE)

byState <- byState %>%
  replyr_coalesce(support,
                  fills= list(count= 0)) %>%
  arrange(STATE)

An important feature of replyr_coalesce is: it checks that the count-carrying rows (the data) are contained in the support rows (the range definition). It would (intentionally) throw an error if we tried to use just datasets::state.abb as the support (or range) definition as this signals the analyst didn’t expect the blank state in the data set.

“The Dog that Didn’t Bark”

An additional illustrative example is from Sir Arthur Conan Doyle’s story “The Adventure of Silver Blaze”.

NewImage

In this story Sherlock Holmes deduces that a horse had been absconded not by a stranger, but by somebody well known at its stable. Holmes explains the key clue here:

   Gregory (Scotland Yard): "Is there any other point to which you 
                             would wish to draw my attention?"
   Holmes: "To the curious incident of the dog in the night-time."
   Gregory: "The dog did nothing in the night-time."
   Holmes: "That was the curious incident."

For this type of reasoning to work: Holmes has to know that there was a dog present, the dog would have barked if a stranger approached the stables, and none of his witnesses reported hearing the dog. Holmes needs an affirmative record that no barks were heard: a zero written in a row, not a lack of rows.

Conclusion

When describing “common pitfalls” the reaction is often: “That isn’t really interesting, as I would never make such an error.” I believe “the zero bug” is in fact common and not noticed as it tends to hide. The bug is often there and invisible, but can produce invalid results.

Any summary that is produced by counting un-weighted rows can never explicitly produce a value of zero (it can only hint at such through missingness). n() can never form zero, so if zero is important it must be explicitly joined in after counting. In a sense organizing rows for counting with n() censors out zeros.

I can’t emphasize enough how important it is to only work with explicit representations (records indicating counts, even if they are zero), and not implicit representations (assuming non-matched keys indicate zero-counts). The advantages of explicit representations are one of the reasons R has a notation for missing values (“NA“) in the first place.

This is, of course, not a problem due to use of dplyr. This is a common problem in designing SQL database queries where you can think of it as “inner-join collapse” (failing to notice rows that disappear due to join conditions).

My advice: you should be suspicious of any use of summarize(VALUE=n()) until you find the matching zero correction (replyr_coalesce, or a comment documenting why no such correction is needed). This looking from n() to a matching range-control should be become as habitual as looking from an opening parenthesis to its matching closing parenthesis. Even in data science, you really have to know you have all the little things right before you rely on the large things.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)