Site icon R-bloggers

COVID-19 cumulative observed case fatality rate over time by @ellis2013nz

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Preamble

I was slightly reluctant to add to the deluge of charts about the COVID-19 outbreak, but on the other hand making charts is one of the ways I relax and try to understand what’s going on around me. So first, to get out of the way my only advice at this point:

For a educational tool on why social distancing works, I particularly recommend this beautifully put together story and simulations from The Washington Post.

Evolving understanding over time of case fatality rate

I have a professional interest in uncertainty and in how we work with partial information and the drip feed of new information. So I was interested to create this chart, showing the case fatality rate (deaths from this disease, divided by all people diagnosed with this disease, in a given period of time) of COVID-19 over time.

What I’m showing here is the cumulative case fatality rate, based on all observations up to a given point in time. Eventually there will be a single number for the case fatality rate of the COVID-19 pandemic of 2020 to 2021, but not yet; the future is uncertain.

An edifying point here for statisticians to note is that if we treated deaths from diagnosed cases as a Bernoulli variable drawn from a population of the ‘true’ case fatality rate and naively estimated sampling error, for most of the chart above it would be negligible. Yet clearly there is a lot of uncertainty about where the rate will end up. There are many sources of uncertainty other than sampling error. Bear this in mind when next you consider an opinion poll.

Obviously a key driver of the change over time is the move of the disease into different populations, and particularly demographically older countries. Another key driver is the degree of testing, which provides the denominator of the case fatality rate. More testing means more known cases, driving the rate down. We see the impact of both these drivers when we decompose the case fatality rate by country:

The USA (grey line in the chart above, label somewhat hidden by France’s) provides a stark example of that second driver – testing was very slow to get off the ground, and as the extent of infections in that country is revealed by more testing the apparent case fatality rate is headed downwards. Italy – one of the oldest countries in the world – provides the exemplar of the first driver although this reveals itself in the high position on the latest data (7.5%!) rather than the trend in Italy’s time series. The increase over time in fatality rate there might be a result of the increasing strain on the health system, or simply an early-outbreak phenomenon from a lag in infections leading to deaths.

Here’s the code for those two simple plots. All the downstream data wrangling is done for me by Johns Hopkins and by Rami Krispin (who took the Johns Hopkins data and tidied it into an R package). All the upstream data collection, wrangling and reporting is done by the governments of various countries under great strain, mostly reporting to the WHO.

#--------------- Setup--------------------
devtools::install_github("RamiKrispin/coronavirus")
library(coronavirus)
library(tidyverse)
library(scales)

the_caption = "Source: WHO and many others via Johns Hopkins University and Rami Krispin's coronavirus R package.\nAnalysis by http://freerangestats.info"

top_countries <- coronavirus %>%
  filter(type == "confirmed") %>%
  group_by(Country.Region) %>%
  summarise(cases = sum(cases)) %>%
  top_n(8, wt = cases)

#---------------------------global total-------------------

first_non_china_d <- coronavirus %>%
  filter(Country.Region != "China" & type == "death" & cases > 0) %>%
  arrange(date) %>%
  slice(1) %>%
  pull(date)

first_italy_d <- coronavirus %>%
  filter(Country.Region == "Italy" & type == "death" & cases > 0) %>%
  arrange(date) %>%
  slice(1) %>%
  pull(date)


d1 <- coronavirus %>%
  group_by(date, type) %>%
  summarise(cases = sum(cases)) %>%
  arrange(date) %>%
  spread(type, cases) %>%
  ungroup() %>%
  mutate(cfr_today = death / confirmed,
         cfr_cumulative = cumsum(death) / cumsum(confirmed))

d1b <- d1 %>%
  filter(date %in% c(first_italy_d, first_non_china_d))
ac <- "steelblue"

d1c <- d1 %>%
  mutate(cc = cumsum(confirmed)) %>% 
  summarise(`10000` = min(date[cc > 10000]),
         `100000` = min(date[cc > 100000])) %>%
  gather(variable, date) %>%
  left_join(d1, by = "date") %>%
  mutate(label = paste0(format(as.numeric(variable), big.mark = ",", scientific = FALSE), "\ncases"))

d1 %>%
  ggplot(aes(x = date, y = cfr_cumulative)) +
  geom_line() +
  scale_y_continuous(label = percent_format(accuracy = 0.1)) +
  expand_limits(y = 0) +
  geom_point(data = d1b, colour = ac, shape = 1, size = 2) +
  annotate("text", x = first_italy_d, 
           y = filter(d1, date == first_italy_d)$cfr_cumulative - 0.001, 
           label = "First death in Italy",
           hjust = 0, size = 3, colour = ac) +
  annotate("text", x = first_non_china_d, 
           y = filter(d1, date == first_non_china_d)$cfr_cumulative + 0.001, 
           label = "First death outside China",
           hjust = 0, size = 3, colour = ac) +
  geom_text(data = d1c, aes(label = label), 
            size = 3, colour = "grey70", 
            hjust = 0.5, lineheight = 0.9, nudge_y = -0.002) +
  labs(caption = the_caption,
       x = "",
       y = "Observed case fatality rate",
       title = "Steadily increasing case fatality rate of COVID-19 in early 2020",
       subtitle = "Increase probably reflects move of the disease into older populations.
Note that actual case fatality is likely to be much lower due to undiagnosed surviving cases.")

#-----------------Country-specific totals------------------------

d2 <- coronavirus %>%
  group_by(date, Country.Region, type) %>%
  summarise(cases = sum(cases)) %>%
  group_by(date, Country.Region) %>%
  spread(type, cases) %>%
  arrange(date) %>%
  group_by(Country.Region) %>%
  mutate(cfr_cumulative = cumsum(death) / cumsum(confirmed)) %>%
  filter(!is.na(cfr_cumulative)) %>%
  ungroup() %>%
  inner_join(top_countries, by = "Country.Region") 


d2 %>%
  ggplot(aes(x = date, y = cfr_cumulative, colour = Country.Region)) +
  geom_line() +
  geom_text(data = filter(d2, date == max(date)), aes(label = Country.Region), 
            hjust = 0, check_overlap = FALSE, size = 3) +
  scale_y_continuous(label = percent_format(accuracy = 1), limits = c(0, 0.2)) +
  scale_colour_brewer(palette = "Set2") +
  expand_limits(x = max(d2$date) + 4) +
  labs(caption = the_caption,
       x = "",
       y = "Observed case fatality rate",
       title = "Country-specific case fatality rate of COVID-19 in early 2020",
       subtitle = "Eight countries with most diagnosed cases; Iran's early values truncated.\nA high level of uncertainty reflecting rapidly changing denominators as well as many unresolved cases.") +
theme(legend.position = "none")

Take care out there.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.