Analyzing data from COVID19 R package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
The idea behind this post was to play and discover some of the info contained in the COVID19 R package which collects data across several governmental sources.This package is being developed by the Guidotti and Ardia from COVID19 Data Hub.
Later, I will add to the analysis the historical track record of deaths over last years for some european countries and try to address if deaths by COVID19 are being reported accurately. This data is collected in The Human Mortality Database.
Altough it may seem a bit overkill, it was such an intensive Tidyverse exercise that I decided to show pretty much all the code right here because that is what this post is about: I don’t intend to perform a really deep analysis but to show a kind of simple way to tackle this problem using R and the Tidyverse toolkit.
You might pick up a couple of tricks like the use of split_group() + map()
to manipulate each group freely, using the {{}}
(bang bang operator) to write programatic dplyr code, some custom plotting with plotly or the recently discovered package ggtext by @ClausWilke.
Playing with COVID19 package
Let’s start by loading data from COVID19 package with covid19
function. It contains lots of information, but I will keep things simple and work only with Country
, Date
, Population
and Deaths
variables.
covid_deaths <- covid19(verbose = FALSE) %>% ungroup() %>% mutate(Week = week(date)) %>% select(Country = id, Date = date, Week, Deaths = deaths, Population = population) %>% filter(Date < today() %>% add(days(-2))) %>% mutate(Deaths_by_1Mpop = round(Deaths/Population*1e6))
I wanted to focus mainly on the most populated countries of the world because some of them are among the most affected by the virus, so I created a function for that as I will use it more than once.
get_top_countries_df <- function(covid_deaths, top_by, top_n, since){ covid_deaths %>% group_by(Date) %>% top_n(100, Population) %>% group_by(Country) %>% filter(Date == max(Date)) %>% ungroup() %>% top_n(top_n, {{top_by}}) %>% select(Country) %>% inner_join(covid_deaths, ., by = "Country") %>% filter(Date >= ymd(since)) }
Starting with a basic plot. You have already seen this one a thousand of times.
ggplotly( covid_deaths %>% get_top_countries_df(top_by = Deaths, top_n = 10, since = 20200301) %>% ggplot(aes(Date, Deaths, col = Country)) + geom_line(size = 1, show.legend = F) + labs(title = "Total deaths due to COVID-19", caption = "Source: covid19datahub.io") + theme_minimal() + theme_custom() + scale_color_tableau() + NULL ) %>% layout(legend = list(orientation = "h", y = 0), annotations = list( x = 1, y = 1.05, text = "Source: covid19datahub.io", showarrow = F, xref = 'paper', yref = 'paper', font = list(size = 10) ) )
What about the countries most affected by the virus in deaths relative to the population? Pretty basic too.
ggplotly( covid_deaths %>% get_top_countries_df(top_by = Deaths_by_1Mpop, top_n = 10, since = 20200301) %>% select(-Deaths) %>% rename(Deaths = Deaths_by_1Mpop) %>% ggplot(aes(Date, Deaths, col = Country)) + geom_line(size = 1, show.legend = F) + labs(title = "Total deaths per million people", caption = "Source: covid19datahub.io") + theme_minimal() + theme_custom() + scale_color_tableau() + NULL ) %>% layout(legend = list(orientation = "h", y = 0), annotations = list( x = 1, y = 1.05, text = "Source: covid19datahub.io", showarrow = F, xref = 'paper', yref = 'paper', font = list(size = 10) ) )
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.