Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Version 0.2.0 of the coronavirus R data package was pushed today to CRAN. The coronavirus package provides a tidy format for Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus dataset. Version 0.2.0 catch up with the significant changes in the data that took place since the initial release on February 24, changing the package status from experimental to maturing.
Additional resources:
- Github page – https://github.com/RamiKrispin/coronavirus
- Package site available here, and vignettes available here
- Supporting dashboard available here, code available here
- CSV format of the dataset available here
- Raw data available here
Key changes and new features
Below are the main updates for version 0.2.0:
- Columns names – updating the geographic location fields with the changes in the raw data:
Province.State
changed toprovince
(US states removed from the raw data)Country.Region
changed tocountry
- The data on the Github version is automatically get updated on a daily basis with Github Actions
update_dataset
function- enables to update the installed version with new data that available on the Github version, more details below- The
covid_south_korea
andcovid_iran
that were avialble on the dev version were removed from the package and moved to new package covid19wiki, for now available only on Github
Data structure
The coronavirus
dataset using long format and it has the following fields:
date
– The date of the summaryprovince
– The province or state, when applicablecountry
– The country or region namelat
– Latitude pointlong
– Longitude pointtype
– the type of case (i.e., confirmed, death)cases
– the number of daily cases (corresponding to the case type)
library(coronavirus) head(coronavirus) ## date province country lat long type cases ## 1 2020-01-22 Afghanistan 33 65 confirmed 0 ## 2 2020-01-23 Afghanistan 33 65 confirmed 0 ## 3 2020-01-24 Afghanistan 33 65 confirmed 0 ## 4 2020-01-25 Afghanistan 33 65 confirmed 0 ## 5 2020-01-26 Afghanistan 33 65 confirmed 0 ## 6 2020-01-27 Afghanistan 33 65 confirmed 0
Keep the data updated
The coronavirus package provides data for an ongoing event that gets updated on a daily basis. In order to enable users to update the CRAN installed version with the most recent data available on the Github version, I created the update_dataset
function. This function check if new data available on the Github version and update the package if needed:
update_dataset(silence = TRUE)
The silence
argument if TRUE, will automatically install updates without prompt question (default is FALSE
). More details available on the following vignette.
In order to make the new data available, you will have to restart your R session.
summarising and visualizing
Here are some examples for summarising and visualizing of the data with the use of the dplyr, tidyr, and plotly packages.
library(dplyr) library(tidyr) library(plotly)
Cases summary
We will start with grouping the dataset by case type and calculate the current worldwide total active cases, and the recovery and death rates:
total_cases <- coronavirus %>% group_by(type) %>% summarise(cases = sum(cases)) %>% mutate(type = factor(type, levels = c("confirmed", "death", "recovered"))) total_cases ## # A tibble: 3 x 2 ## type cases ## <fct> <int> ## 1 confirmed 4261747 ## 2 death 291942 ## 3 recovered 1493414
The total active cases are the difference between the total confirmed cases and the total recovered and death cases:
total_cases$cases[1] - total_cases$cases[2] - total_cases$cases[3] ## [1] 2476391
The worldwide recovery rate is:
round(100 * total_cases$cases[3] / total_cases$cases[1], 2) ## [1] 35.04
And worldwide death rate is:
round(100 * total_cases$cases[2] / total_cases$cases[1], 2) ## [1] 6.85
The following plot presents the cases (active, recovered, and death) distribution over time:
coronavirus %>% group_by(type, date) %>% summarise(total_cases = sum(cases)) %>% pivot_wider(names_from = type, values_from = total_cases) %>% arrange(date) %>% mutate(active = confirmed - death - recovered) %>% mutate(active_total = cumsum(active), recovered_total = cumsum(recovered), death_total = cumsum(death)) %>% plot_ly(x = ~ date, y = ~ active_total, name = 'Active', fillcolor = '#1f77b4', type = 'scatter', mode = 'none', stackgroup = 'one') %>% add_trace(y = ~ death_total, name = "Death", fillcolor = '#E41317') %>% add_trace(y = ~recovered_total, name = 'Recovered', fillcolor = 'forestgreen') %>% layout(title = "Distribution of Covid19 Cases Worldwide", legend = list(x = 0.1, y = 0.9), yaxis = list(title = "Number of Cases"), xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))
Distribution of confirmed cases by country
The next plot summarize the distribution of confrimed cases by country with the use of the treemap plot:
conf_df <- coronavirus %>% filter(type == "confirmed") %>% group_by(country) %>% summarise(total_cases = sum(cases)) %>% arrange(-total_cases) %>% mutate(parents = "Confirmed") %>% ungroup() plot_ly(data = conf_df, type= "treemap", values = ~total_cases, labels= ~ country, parents= ~parents, domain = list(column=0), name = "Confirmed", textinfo="label+value+percent parent")
Package contributers
I would like to thank all the people that contributed to the package development and asked questions, report, and filed issues about issues with the data.
A special thanks for Amanda Dobbyn (@dobbleobble) and Jarrett Byrnes (@jebyrnes) for their pull request and suggestion that lead for the kick of the covid19R proejct, and to Mine Cetinkaya-Rundel (@minebocek) for providing a better format for the dataset documenation!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.