Vaccination Data: An Outlier Tale

An Accounting and Data Science Nerd's Corner

1 year ago

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently included the new Our World in Data data on Covid-19 hospitalizations and the vaccination progress around the world in the {tidycovid19} package. What was meant to be a short info post for package users turned into a mini case on “outliers”.

Thankfully, the OWID team makes their Covid-19 data available in a well-maintained and documented form on Github so that importing and merging it into the data that the package offers is a breeze. By the way: The dataset provided by Our World in Data also contains some additional data items that might be of interest, so I encourage everybody to take a look at their data repository on Github directly besides (or instead) relying on the data that is provided by the {tidycovid19} package.

As you can see below, the data coverage is currently limited to a small set of countries and short time series.

library(tidyverse)
library(tidycovid19)
library(kableExtra)
library(ggridges)
library(stargazer)

df <- download_merged_data(cached = TRUE, silent = TRUE)

nobs <- df %>%
  group_by(iso3c) %>%
  summarise(
    nobs_hosp = sum(!is.na(hosp_patients)),
    nobs_icu = sum(!is.na(icu_patients)),
    nobs_vacc = sum(!is.na(total_vaccinations)),
    .groups = "drop"
  ) %>%
  filter(
    nobs_hosp != 0 | nobs_icu != 0 | nobs_vacc  != 0
  ) %>%
  arrange(iso3c)

kable(nobs) %>% kable_styling()

iso3c	nobs_hosp	nobs_icu	nobs_vacc
ARE	0	0	3
ARG	0	0	4
AUT	277	277	2
BEL	295	295	2
BGR	273	273	7
BHR	0	0	18
CAN	305	305	27
CHL	0	0	7
CHN	0	0	3
CRI	0	0	3
CYP	301	301	1
CZE	309	309	2
DEU	0	290	13
DNK	0	0	14
ESP	93	93	4
EST	313	313	12
FIN	177	177	5
FRA	316	315	4
GBR	285	280	3
GIN	0	0	1
GRC	0	0	13
HRV	264	0	6
HUN	261	0	9
IRL	303	282	3
ISL	266	224	1
ISR	0	0	22
ITA	315	315	14
KWT	0	0	1
LTU	47	0	12
LUX	294	315	1
LVA	277	0	9
MEX	0	0	11
MLT	0	0	1
NLD	217	312	4
NOR	263	0	9
OMN	0	0	12
POL	263	0	11
PRT	307	307	6
ROU	0	278	13
RUS	0	0	3
SAU	0	0	1
SVK	249	0	3
SVN	300	300	7
SWE	0	280	1
USA	297	288	12

As an economist, I am interested in the effect of economic wealth on health issues, so an obvious first question to ask the data is: How does the association between vaccination status (measured by total vaccinations by 100,000 inhabitants) to GDP per capita look like.

clevel <- df %>%
  group_by(iso3c) %>%
  filter(any(!is.na(total_vaccinations))) %>%
  mutate(
    vacc_1e5pop = 1e5*(total_vaccinations/population),
    cases_1e5pop = 1e5*(confirmed/population),
    deaths_1e5pop = 1e5*(deaths/population)
  ) %>%
  summarise(
    vacc_1e5pop  = max(vacc_1e5pop, na.rm = TRUE),
    cases_1e5pop = max(cases_1e5pop, na.rm = TRUE),
    deaths_1e5pop = max(deaths_1e5pop, na.rm = TRUE),
    gdp_capita = max(gdp_capita, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  na.omit() 

plot_clevel_vacc_by_x <- function(df, xvar, xlab) {
  xvar <- enquo(xvar)
  ggplot(df, aes(x = !!xvar, y = vacc_1e5pop)) +
    geom_point() +
    scale_x_log10() +
    scale_y_log10() +
    theme_minimal() +
    labs(
      x = xlab,
      y = "Vaccinations per 100,000 inhabitants"
    ) +
    ggrepel::geom_text_repel(aes(label = iso3c)) +
    geom_smooth(method = "lm", formula = "y ~x")
}

plot_clevel_vacc_by_x(clevel, gdp_capita, "National income per capita (2010 US-$)")

Based on the above one would make the (unfortunate) observation that GDP per capita is positively associated with vaccination progress. While this does not imply causality, it is certainly in line with a lot of prior evidence that “richer” countries have access to better medical care and the failure on the international community to allocate the medical resources to the areas where they are needed most.

But hey, I hear you say, this looks like a textbook example of an “Outlier” and also like a “bad leverage point” that is highly-influential for the regression coefficient. Let’s see how the plot looks like without the 25 vaccinations that the OWID team reports for Guinea.

plot_clevel_vacc_by_x(
  clevel %>% filter(iso3c != "GIN"), 
  gdp_capita, "National income per capita (2010 US-$)"
)

Different, huh? Now it looks as if there is no meaningful association between GDP and vaccination progress. The problem with outliers is that we often do not know whether they represent valid data points hinting at an under-observed part of the distribution or whether they are an erroneous artifact of the data generating process. In economics and other observational sciences we often deal with them in an ad hoc manner by truncating or winsorizing.¹ In this case, however, we have good reasons to assume that Guinea is not a data error but simply an under-observed but completely valid observation. There are two reasons why countries might not be reported in the OWID data:

They have not started their vaccination program yet
They do not publicly report their vaccination data

Both reasons can be assumed to be more likely for poorer countries. Let’s look at the resulting selection problem by comparing countries that have vaccination data available with the ones that don’t.

has_vacc_data <- df %>%
  select(iso3c, total_vaccinations, gdp_capita, deaths, confirmed, population) %>%
  group_by(iso3c) %>%
  filter(
    !all(is.na(confirmed)) & !all(is.na(deaths)) & !all(is.na(population)) &
    !all(is.na(gdp_capita))
  ) %>%
  summarise(
    has_vacc_data = sum(!is.na(total_vaccinations)) > 0,
    gdp_capita = mean(gdp_capita),
    cases = max(1e5*(confirmed/population), na.rm = TRUE),
    deaths = max(1e5*(deaths/population), na.rm = TRUE),
    .groups = "drop"
  )

plot_sel_bias <- function(df, xvar, xlab) {
  xvar <- enquo(xvar)
  ggplot(
    data = df, 
    aes(
      x = !!xvar, y = has_vacc_data, 
      fill = has_vacc_data, height = stat(density)
    )
  ) +
    geom_density_ridges(
      stat = "binline", bins = 20, scale = 0.95
    ) +
    scale_x_log10(labels = scales::comma_format(accuracy = 0.1)) +
    labs(
      x = xlab,
      y = "",
      title = "OWID provides vaccination data"
    ) +
    theme_ridges() +
    theme(
      legend.position = "none",
      plot.title.position = "plot"
    )
}

plot_sel_bias(has_vacc_data, gdp_capita, "GDP per capita (in 2010 US-$, log-scaled)")

Now this confirms our expectations. With the sole exception of Guinea, only relatively wealthy countries are currently included in the OWID vaccination data. This by no means implies criticism of the OWID data collection process but is most likely just an outcome of them having the financial means to start the vaccination process and the governance infrastructure to report reliable data on it.

Let us see whether at least also countries that are more heavily affected by Covid-19 are more likely to show up in the vaccination data.

plot_sel_bias(has_vacc_data, cases, "Covid-19 cases per 100,000 inhabitants (log-scaled)")

plot_sel_bias(has_vacc_data, deaths, "Covid-19 deaths per 100,000 inhabitants (log-scaled)")

Now this looks reassuring. The vaccinations seem to be starting in those countries hardest hit by Covid 19. Does this also holds when we control for GDP? We focus on deaths here as deaths and cases are highly correlated

mod <- glm(
  has_vacc_data ~ log(gdp_capita) + log(deaths), 
  data = has_vacc_data %>% filter(deaths > 0),
  family = "binomial"
) 

stargazer(mod, type = "html")


	Dependent variable:

	has_vacc_data

log(gdp_capita)	1.457^***
	(0.332)

log(deaths)	0.598^***
	(0.228)

Constant	-16.070^***
	(3.374)


Observations	112
Log Likelihood	-33.091
Akaike Inf. Crit.	72.182

Note:	p<0.1; p<0.05; p<0.01

Good. Even after controlling for GDP, countries which suffer more from Covid-19 have started the vaccination process earlier. While this might also be driven by differences in data quality across countries, I take this as a piece of well needed good news these days. Last question: How does that look like for the intensive margin, meaning for the number of vaccinations for the countries that have vaccination data available?

mod <- lm(
  log(vacc_1e5pop) ~ log(gdp_capita) + log(deaths_1e5pop), 
  data = clevel 
) 

stargazer(mod, type = "html")


	Dependent variable:

	log(vacc_1e5pop)

log(gdp_capita)	1.004^***
	(0.282)

log(deaths_1e5pop)	0.059
	(0.182)

Constant	-4.446
	(2.727)


Observations	45
R²	0.265
Adjusted R²	0.230
Residual Std. Error	1.469 (df = 42)
F Statistic	7.589^*** (df = 2; 42)

Note:	p<0.1; p<0.05; p<0.01

Nope. And after omitting Guinea?

mod <- lm(
  log(vacc_1e5pop) ~ log(gdp_capita) + log(deaths_1e5pop), 
  data = clevel %>% filter(iso3c != "GIN")
) 

stargazer(mod, type = "html")


	Dependent variable:

	log(vacc_1e5pop)

log(gdp_capita)	0.301
	(0.276)

log(deaths_1e5pop)	-0.241
	(0.162)

Constant	3.992
	(2.879)


Observations	44
R²	0.074
Adjusted R²	0.029
Residual Std. Error	1.203 (df = 41)
F Statistic	1.647 (df = 2; 41)

Note:	p<0.1; p<0.05; p<0.01

Still not. In addition, you see that the GDP coefficient is now also insignificant. Outliers matter. Everybody, see for yourself by pulling the data using the {tidycovid19} package or directly from the Our World in Data Repository.

Stay healthy and enjoy!

See here for a current working paper on the topic by David Veenman and myself.↩

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

iso3c	nobs_hosp	nobs_icu	nobs_vacc
ARE	0	0	3
ARG	0	0	4
AUT	277	277	2
BEL	295	295	2
BGR	273	273	7
BHR	0	0	18
CAN	305	305	27
CHL	0	0	7
CHN	0	0	3
CRI	0	0	3
CYP	301	301	1
CZE	309	309	2
DEU	0	290	13
DNK	0	0	14
ESP	93	93	4
EST	313	313	12
FIN	177	177	5
FRA	316	315	4
GBR	285	280	3
GIN	0	0	1
GRC	0	0	13
HRV	264	0	6
HUN	261	0	9
IRL	303	282	3
ISL	266	224	1
ISR	0	0	22
ITA	315	315	14
KWT	0	0	1
LTU	47	0	12
LUX	294	315	1
LVA	277	0	9
MEX	0	0	11
MLT	0	0	1
NLD	217	312	4
NOR	263	0	9
OMN	0	0	12
POL	263	0	11
PRT	307	307	6
ROU	0	278	13
RUS	0	0	3
SAU	0	0	1
SVK	249	0	3
SVN	300	300	7
SWE	0	280	1
USA	297	288	12

iso3c	nobs_hosp	nobs_icu	nobs_vacc
ARE	0	0	3
ARG	0	0	4
AUT	277	277	2
BEL	295	295	2
BGR	273	273	7
BHR	0	0	18
CAN	305	305	27
CHL	0	0	7
CHN	0	0	3
CRI	0	0	3
CYP	301	301	1
CZE	309	309	2
DEU	0	290	13
DNK	0	0	14
ESP	93	93	4
EST	313	313	12
FIN	177	177	5
FRA	316	315	4
GBR	285	280	3
GIN	0	0	1
GRC	0	0	13
HRV	264	0	6
HUN	261	0	9
IRL	303	282	3
ISL	266	224	1
ISR	0	0	22
ITA	315	315	14
KWT	0	0	1
LTU	47	0	12
LUX	294	315	1
LVA	277	0	9
MEX	0	0	11
MLT	0	0	1
NLD	217	312	4
NOR	263	0	9
OMN	0	0	12
POL	263	0	11
PRT	307	307	6
ROU	0	278	13
RUS	0	0	3
SAU	0	0	1
SVK	249	0	3
SVN	300	300	7
SWE	0	280	1
USA	297	288	12

iso3c	nobs_hosp	nobs_icu	nobs_vacc
ARE	0	0	3
ARG	0	0	4
AUT	277	277	2
BEL	295	295	2
BGR	273	273	7
BHR	0	0	18
CAN	305	305	27
CHL	0	0	7
CHN	0	0	3
CRI	0	0	3
CYP	301	301	1
CZE	309	309	2
DEU	0	290	13
DNK	0	0	14
ESP	93	93	4
EST	313	313	12
FIN	177	177	5
FRA	316	315	4
GBR	285	280	3
GIN	0	0	1
GRC	0	0	13
HRV	264	0	6
HUN	261	0	9
IRL	303	282	3
ISL	266	224	1
ISR	0	0	22
ITA	315	315	14
KWT	0	0	1
LTU	47	0	12
LUX	294	315	1
LVA	277	0	9
MEX	0	0	11
MLT	0	0	1
NLD	217	312	4
NOR	263	0	9
OMN	0	0	12
POL	263	0	11
PRT	307	307	6
ROU	0	278	13
RUS	0	0	3
SAU	0	0	1
SVK	249	0	3
SVN	300	300	7
SWE	0	280	1
USA	297	288	12