Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Get Your Epidemiology from Epidemiologists
The COVID-19 pandemic continues to rage. I’m strongly committed to what should be the uncontroversial view that we should listen to the recommendations of those institutions and individuals with strong expertise in the relevant fields of Public Health, Epidemiology, Disease Control, and Infection Modeling. I also think that the open availability of data, and the free availability of methods to look at data, is generally a good thing. The tricky part is when these potentially conflict. For example, in a period of crisis it is reasonable to want to find out what’s happening and to inform yourself as much as possible about how events are unfolding. People who work with data of some sort will naturally want to look at the available trends themselves. But maybe those same people don’t know a great deal about how disease works, or how information about it is collected and processed, or what is likely to happen in a situation like the one we’re experiencing. At such times, there’s a balance to be struck between using the available tools to come to an informed opinion and recklessly mucking about with data when you don’t really know what you’re doing. This is especially important when, as is the case now, the Executive response to the crisis in the United States (and in several other countries) has been criminally irresponsible, to the point where even elementary facts about the spread of the disease over the past few months are being distorted.
Speaking for myself, I definitely want look at what the trends are and I prefer to do so by working directly with the data that official agencies and reliable reporting produces. So in this post I’ll show how I’m doing that. But I definitely don’t want to publicly mess around beyond this. While I might idly fit some models or play with various extrapolations of the data, I’m very conscious that I am not in a position to do this in a professional capacity. So that part I will firmly set aside. There are already many well-qualified people working publicly to actually analyze and model the data, as opposed to looking descriptively at what is happening.
I’m going to show you how to get the data to draw this graph.
Looking at COVID-19 Data from the European Centers for Disease Control
Each day, the ECDC publishes a a summary spreadsheet of global case and death counts since the beginning of the epidemic. This is good data collated by an EU-wide agency, and it’s what I’ve been using to keep up with the trends. There are other reliable sources, too, most notably the Johns Hopkins Coronavirus Dashboard. Here’s what I’ve been doing to get it into R. Again my principal reason for sharing this code is not to add much of anything on the public side. It’s much more of a pedagogical exercise. If you want to look at this data, here’s one way to do that. Along the way I’ll talk about a few of the things needed to work with the data in a reasonably clean way. Then I’ll end up drawing the plot that everyone draws—showing cumulative trends by country in deaths, counted in days since a threshold level of fatalities.
Preparation
First we load some libraries to help us out.
1 2 3 4 5 6 7 8 9 10 |
library(tidyverse) library(lubridate) library(here) library(janitor) library(socviz) library(ggrepel) library(paletteer) |
Next, we set things up by writing some functions that will help us grab and clean the data. In reality, of course, these functions got written piecemeal and were then cleaned up and moved to the front of the file. I didn’t sit down and write them off the top of my head.
The first one is going to grab the spreadsheet from the ECDC and both save the .xlsx
file to our data/
folder and create a tibble of the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
## Download today's excel file, saving it to data/ and reading it in get_ecdc_data <- function(url = "https://www.ecdc.europa.eu/sites/default/files/documents/", fname = "COVID-19-geographic-distribution-worldwide-", date = lubridate::today(), ext = "xlsx", dest = "data") { target <- paste0(url, fname, date, ".", ext) message("target: ", target) destination <- fs::path(here::here("data"), paste0(fname, date), ext = ext) message("saving to: ", destination) tf <- tempfile(fileext = ext) curl::curl_download(target, tf) fs::file_copy(tf, destination) switch(ext, xls = janitor::clean_names(readxl::read_xls(tf)), xlsx = janitor::clean_names(readxl::read_xlsx(tf)) ) } |
Things to notice: We have to use curl_download()
to get the file, because read_xls
cannot directly grab an Excel file from a URL in the way that e.g. read_csv()
can for a .csv
file. So we create a temporary file handle and use curl
to download the data file to it. Then we copy the file to its permanent home in our data/
folder, and we read the target file into R with the appropriate readxl
function.
As we’ll see in a moment, the country codes contained in the ECDC data are not quite standard. It will be useful in the long run to make sure that every country has standardized two- and three-letter abbreviations. Some of the countries in the ECDC’s geo_id
variable are missing these. This is a very common situation in data cleaning, where we have a big table with some data we know is missing (e.g., a country code), and we know for sure which cases the data are missing for, and we have a little lookup table that can fill in the blanks. The operation we will need to perform here is called a coalescing join. Before I knew that’s what it was called, I used to do this manually (I’ll show you below). But a little googling eventually revealed both the proper name for this operation and also a very useful function, written by Edward Visel that does exactly what I want:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
coalesce_join <- function(x, y, by = NULL, suffix = c(".x", ".y"), join = dplyr::full_join, ...) { joined <- join(x, y, by = by, suffix = suffix, ...) # names of desired output cols <- union(names(x), names(y)) to_coalesce <- names(joined)[!names(joined) %in% cols] suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)] # remove suffixes and deduplicate to_coalesce <- unique(substr( to_coalesce, 1, nchar(to_coalesce) - nchar(suffix_used) )) coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce( joined[[paste0(.x, suffix[1])]], joined[[paste0(.x, suffix[2])]] )) names(coalesced) <- to_coalesce dplyr::bind_cols(joined, coalesced)[cols] } |
Next we set up some country codes using ISO2 and ISO3 abbreviations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
iso3_cnames <- read_csv("data/countries_iso3.csv") iso2_to_iso3 <- read_csv("data/iso2_to_iso3.csv") cname_table <- left_join(iso3_cnames, iso2_to_iso3) cname_table # A tibble: 249 x 3 iso3 cname iso2 <chr> <chr> <chr> 1 AFG Afghanistan AF 2 ALA Åland Islands AX 3 ALB Albania AL 4 DZA Algeria DZ 5 ASM American Samoa AS 6 AND Andorra AD 7 AGO Angola AO 8 AIA Anguilla AI 9 ATA Antarctica AQ 10 ATG Antigua and Barbuda AG # … with 239 more rows eu <- c("AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE", "GBR") europe <- c("ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FRO", "FIN", "FRA", "DEU", "GIB", "GRC", "HUN", "ISL", "IRL", "ITA", "LVA", "LIE", "LTU", "LUX", "MKD", "MLT", "MDA", "MCO", "NLD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR", "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR", "GBR", "VAT", "RSB", "IMN", "MNE") north_america <- c("AIA", "ATG", "ABW", "BHS", "BRB", "BLZ", "BMU", "VGB", "CAN", "CYM", "CRI", "CUB", "CUW", "DMA", "DOM", "SLV", "GRL", "GRD", "GLP", "GTM", "HTI", "HND", "JAM", "MTQ", "MEX", "SPM", "MSR", "ANT", "KNA", "NIC", "PAN", "PRI", "KNA", "LCA", "SPM", "VCT", "TTO", "TCA", "VIR", "USA", "SXM") south_america <- c("ARG", "BOL", "BRA", "CHL", "COL", "ECU", "FLK", "GUF", "GUY", "PRY", "PER", "SUR", "URY", "VEN") africa <- c("DZA", "AGO", "SHN", "BEN", "BWA", "BFA", "BDI", "CMR", "CPV", "CAF", "TCD", "COM", "COG", "DJI", "EGY", "GNQ", "ERI", "ETH", "GAB", "GMB", "GHA", "GNB", "GIN", "CIV", "KEN", "LSO", "LBR", "LBY", "MDG", "MWI", "MLI", "MRT", "MUS", "MYT", "MAR", "MOZ", "NAM", "NER", "NGA", "STP", "REU", "RWA", "STP", "SEN", "SYC", "SLE", "SOM", "ZAF", "SHN", "SDN", "SWZ", "TZA", "TGO", "TUN", "UGA", "COD", "ZMB", "TZA", "ZWE", "SSD", "COD") asia <- c("AFG", "ARM", "AZE", "BHR", "BGD", "BTN", "BRN", "KHM", "CHN", "CXR", "CCK", "IOT", "GEO", "HKG", "IND", "IDN", "IRN", "IRQ", "ISR", "JPN", "JOR", "KAZ", "PRK", "KOR", "KWT", "KGZ", "LAO", "LBN", "MAC", "MYS", "MDV", "MNG", "MMR", "NPL", "OMN", "PAK", "PHL", "QAT", "SAU", "SGP", "LKA", "SYR", "TWN", "TJK", "THA", "TUR", "TKM", "ARE", "UZB", "VNM", "YEM", "PSE") oceania <- c("ASM", "AUS", "NZL", "COK", "FJI", "PYF", "GUM", "KIR", "MNP", "MHL", "FSM", "UMI", "NRU", "NCL", "NZL", "NIU", "NFK", "PLW", "PNG", "MNP", "SLB", "TKL", "TON", "TUV", "VUT", "UMI", "WLF", "WSM", "TLS") |
Now Actually Get the Data
The next step is to read the data. The file should be called COVID-19-geographic-distribution-worldwide-
with the date appended and the extension .xlsx
. But as it turns out there is a typo in the filename. The distribution
part is misspelled disbtribution
. I think it must have been introduced early on in the data collection process and so far—possibly by accident, but also possibly so as not to break a thousand scripts like this one—they have not been fixing the typo.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
covid_raw <- get_ecdc_data(url = "https://www.ecdc.europa.eu/sites/default/files/documents/", fname = "COVID-19-geographic-disbtribution-worldwide-", ext = "xlsx") covid_raw # A tibble: 6,012 x 8 date_rep day month year cases deaths countries_and_t… <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 2020-03-21 00:00:00 21 3 2020 2 0 Afghanistan 2 2020-03-20 00:00:00 20 3 2020 0 0 Afghanistan 3 2020-03-19 00:00:00 19 3 2020 0 0 Afghanistan 4 2020-03-18 00:00:00 18 3 2020 1 0 Afghanistan 5 2020-03-17 00:00:00 17 3 2020 5 0 Afghanistan 6 2020-03-16 00:00:00 16 3 2020 6 0 Afghanistan 7 2020-03-15 00:00:00 15 3 2020 3 0 Afghanistan 8 2020-03-11 00:00:00 11 3 2020 3 0 Afghanistan 9 2020-03-08 00:00:00 8 3 2020 3 0 Afghanistan 10 2020-03-02 00:00:00 2 3 2020 0 0 Afghanistan # … with 6,002 more rows, and 1 more variable: geo_id <chr> |
That’s our base data. The get_ecdc_data()
function uses file_copy()
from the fs
library to move the temporary file to the data/
folder. It will not overwrite a file if it finds one with that name already there. So if you grab the data more than once a day, you’ll need to decide what to do with the file you already downloaded.
The geo_id
country code column isn’t visible here. We’re going to duplicate it (naming it iso2
) and then join our table of two- and three-letter country codes. It has an iso2
column as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
covid <- covid_raw %>% mutate(date = lubridate::ymd(date_rep), iso2 = geo_id) ## merge in the iso country names covid <- left_join(covid, cname_table) covid # A tibble: 6,012 x 12 date_rep day month year cases deaths countries_and_t… <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 2020-03-21 00:00:00 21 3 2020 2 0 Afghanistan 2 2020-03-20 00:00:00 20 3 2020 0 0 Afghanistan 3 2020-03-19 00:00:00 19 3 2020 0 0 Afghanistan 4 2020-03-18 00:00:00 18 3 2020 1 0 Afghanistan 5 2020-03-17 00:00:00 17 3 2020 5 0 Afghanistan 6 2020-03-16 00:00:00 16 3 2020 6 0 Afghanistan 7 2020-03-15 00:00:00 15 3 2020 3 0 Afghanistan 8 2020-03-11 00:00:00 11 3 2020 3 0 Afghanistan 9 2020-03-08 00:00:00 8 3 2020 3 0 Afghanistan 10 2020-03-02 00:00:00 2 3 2020 0 0 Afghanistan # … with 6,002 more rows, and 5 more variables: geo_id <chr>, # date <date>, iso2 <chr>, iso3 <chr>, cname <chr> |
At this point we can notice a couple of things about the dataset. For example, not everything in the dataset is a country. This one’s a cruise ship:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## Looks like a missing data code covid %>% filter(cases == -9) # A tibble: 1 x 12 date_rep day month year cases deaths countries_and_t… <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 2020-03-10 00:00:00 10 3 2020 -9 1 Cases_on_an_int… # … with 5 more variables: geo_id <chr>, date <date>, iso2 <chr>, # iso3 <chr>, cname <chr> |
We can also learn, using an anti_join()
that not all the ECDC’s geo_id
country codes match up with the ISO codes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
anti_join(covid, cname_table) %>% select(geo_id, countries_and_territories, iso2, iso3, cname) %>% distinct() # A tibble: 7 x 5 geo_id countries_and_territories iso2 iso3 cname <chr> <chr> <chr> <chr> <chr> 1 JPG11668 Cases_on_an_international_conveyance_J… JPG116… <NA> <NA> 2 PYF French_Polynesia PYF <NA> <NA> 3 EL Greece EL <NA> <NA> 4 XK Kosovo XK <NA> <NA> 5 NA Namibia NA <NA> <NA> 6 AN Netherlands_Antilles AN <NA> <NA> 7 UK United_Kingdom UK <NA> <NA> |
Let’s fix this. I made a small crosswalk file that can be coalesced into the missing values. In an added little wrinkle, we need to specify the na
argument in read_csv
explicity because the missing country codes include Namibia, which has an ISO country code of “NA”! This is different from the missing data code NA
but read_csv()
won’t know this by default.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
cname_xwalk <- read_csv("data/ecdc_to_iso2_xwalk.csv", na = "") cname_xwalk # A tibble: 4 x 3 geo_id iso3 cname <chr> <chr> <chr> 1 UK GBR United Kingdom 2 EL GRC Greece 3 NA NAM Namibia 4 XK XKV Kosovo |
I used to do coalescing like this:
1 2 3 4 5 6 7 8 |
# covid <- covid %>% # left_join(cname_xwalk, by = "geo_id") %>% # mutate(iso3 = coalesce(iso3.x, iso3.y), # cname = coalesce(cname.x, cname.y)) %>% # select(-iso3.x, -iso3.y, cname.x, cname.y) |
Actually, I used to do it using match()
and some index vectors, like an animal. But now I can use Edward Visel’s handy function instead.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
covid <- coalesce_join(covid, cname_xwalk, by = "geo_id", join = dplyr::left_join) ## Take a look again anti_join(covid, cname_table) %>% select(geo_id, countries_and_territories, iso2, iso3, cname) %>% distinct() # A tibble: 7 x 5 geo_id countries_and_territories iso2 iso3 cname <chr> <chr> <chr> <chr> <chr> 1 JPG11668 Cases_on_an_international_convey… JPG116… <NA> <NA> 2 PYF French_Polynesia PYF <NA> <NA> 3 EL Greece EL GRC Greece 4 XK Kosovo XK XKV Kosovo 5 NA Namibia NA NAM Namibia 6 AN Netherlands_Antilles AN <NA> <NA> 7 UK United_Kingdom UK GBR United Kin… |
Looks like a couple of new territories have been added to the ECDC file since I made the crosswalk file. I’ll have to update that soon.
Calculate and Plot Cumulative Mortality
Now we can actually analyze the data (in the privacy of our own home). Let’s draw the plot that everyone draws, looking at cumulative counts. I think it’s better at this point to plot cumulative deaths rather than cumulative reported cases, given that there’s so much unevenness in case reporting. The mortality counts aren’t free of that, but it’s not as much of a problem. We’ll take an arbitrary threshold for number of deaths, let’s say ten, start every country from zero days when they hit ten deaths, and count the cumulative deaths since that day.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
cov_curve <- covid %>% select(date, cname, iso3, cases, deaths) %>% drop_na(iso3) %>% group_by(iso3) %>% arrange(date) %>% mutate(cu_cases = cumsum(cases), cu_deaths = cumsum(deaths)) %>% filter(cu_deaths > 9) %>% mutate(days_elapsed = date - min(date), end_label = ifelse(date == max(date), cname, NA)) cov_curve # A tibble: 245 x 9 # Groups: iso3 [21] date cname iso3 cases deaths cu_cases cu_deaths days_elapsed <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <drtn> 1 2020-01-22 China CHN 140 11 526 17 0 days 2 2020-01-23 China CHN 97 0 623 17 1 days 3 2020-01-24 China CHN 259 9 882 26 2 days 4 2020-01-25 China CHN 441 15 1323 41 3 days 5 2020-01-26 China CHN 665 15 1988 56 4 days 6 2020-01-27 China CHN 787 25 2775 81 5 days 7 2020-01-28 China CHN 1753 25 4528 106 6 days 8 2020-01-29 China CHN 1466 26 5994 132 7 days 9 2020-01-30 China CHN 1740 38 7734 170 8 days 10 2020-01-31 China CHN 1980 43 9714 213 9 days # … with 235 more rows, and 1 more variable: end_label <chr> |
See how at the end there we create an end_label
variable for use in the plot. It only has values for the most recent day in the dataset (i.e. the country name if date
is max(date)
, otherwise NA
).
Now we’ll narrow our focus to a few countries and make the plot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
focus_cn <- c("CHN", "GBR", "USA", "IRN", "JPN", "KOR", "ITA", "FRA", "ESP") cov_curve %>% filter(iso3 %in% focus_cn) %>% ## focus on just a few countries, defined above mutate(end_label = recode(end_label, `United States` = "USA", `Iran, Islamic Republic of` = "Iran", `Korea, Republic of` = "South Korea", `United Kingdom` = "UK")) %>% ggplot(mapping = aes(x = days_elapsed, y = cu_deaths, color = cname, label = end_label, group = cname)) + geom_line(size = 0.8) + geom_text_repel(nudge_x = 1.1, nudge_y = 0.1, segment.color = NA) + guides(color = FALSE) + scale_color_manual(values = prismatic::clr_darken(paletteer_d("ggsci::category20_d3"), 0.2)) + scale_y_continuous(labels = scales::comma_format(accuracy = 1), breaks = 2^seq(4, 11), trans = "log2") + labs(x = "Days Since 10th Confirmed Death", y = "Cumulative Number of Deaths (log scale)", title = "Cumulative Deaths from COVID-19, Selected Countries", subtitle = paste("Data as of", format(max(cov_curve$date), "%A, %B %e, %Y")), caption = "Kieran Healy @kjhealy / Data: ECDC") + theme(plot.title = element_text(size = rel(2), face = "bold"), plot.subtitle = element_text(size = rel(1.5)), axis.text.y = element_text(size = rel(2)), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5)), axis.text.x = element_text(size = rel(2)), legend.text = element_text(size = rel(2)) ) |
Again, a few small details polish the plot. We do a quick bit of recoding on the end_label
to shorten some country names, and use geom_text_repel()
to put the labels at the end of the line. We get our y-axis breaks with 2^seq(4, 11)
, which (as case numbers rise) will be easier to extend than manually typing all the numbers. I use a base 2 log scale for the reasons Dr Drang gives here. It’s useful to look at the doubling time, which base 2 helps you see, rather than powers of ten. (The graphs won’t look any different.) Finally on the thematic side we can date-stamp the title of the graph using the opaque but standard UNIX date formatting codes, with paste("Data as of", format(max(cov_curve$date), "%A, %B %e, %Y"))
.
And here’s our figure.
The GitHub repository for this post also has some code to pull U.S. data from the COVID Tracking Project currently being run by a group of volunteers.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.