Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
With all this Covid19 madness going on in the world, I felt inspired to utilize the vast amount of data we have from it. I came across a dataset by the The NY Times. From reading over the GitHub repository, they obtain this time series data from state and local governments and health departments. To get a better sense of the data please go to the link above! The objective of what I am trying to accomplish is to create an animated time series map showing how Covid19 spread throughout NJ.
Read in the data
To read in the data I will use the readr
package.
library(readr) #loads package covid19_data<-read_csv("us_counties.csv") #function that reads in csv files
Now that the data is read in.. lets check the structure of the data
str(covid19_data) ## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 38197 obs. of 6 variables: ## $ date : Date, format: "2020-01-21" "2020-01-22" ... ## $ county: chr "Snohomish" "Snohomish" "Snohomish" "Cook" ... ## $ state : chr "Washington" "Washington" "Washington" "Illinois" ... ## $ fips : chr "53061" "53061" "53061" "17031" ... ## $ cases : num 1 1 1 1 1 1 1 1 1 1 ... ## $ deaths: num 0 0 0 0 0 0 0 0 0 0 ... ## - attr(*, "spec")= ## .. cols( ## .. date = col_date(format = ""), ## .. county = col_character(), ## .. state = col_character(), ## .. fips = col_character(), ## .. cases = col_double(), ## .. deaths = col_double() ## .. )
Everything seems to look correct. All the columns seem to have the right structure!
Filter data
Being that we only want to see how covid19 spread in NJ, we will filter the data frame to only show cases in NJ. This can be done with the dplyr
package. The data also has a great deal of cases that are “Unknown”. By reading over the documentation of the data, it says many state health departments chose to report cases separately when the patient’s county of residence is unknown or pending determination. I will take this data out since we wouldn’t be able to link it to a specific county.
library(dplyr) # used for data wrangling NJ_covid19<-covid19_data%>% dplyr::filter(state == "New Jersey",county != "Unknown") #filters data frame head(NJ_covid19) # Returns the first or last parts of the data frame ## # A tibble: 6 x 6 ## date county state fips cases deaths ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 2020-03-04 Bergen New Jersey 34003 1 0 ## 2 2020-03-05 Bergen New Jersey 34003 2 0 ## 3 2020-03-06 Bergen New Jersey 34003 3 0 ## 4 2020-03-06 Camden New Jersey 34007 1 0 ## 5 2020-03-07 Bergen New Jersey 34003 3 0 ## 6 2020-03-07 Camden New Jersey 34007 1 0
Obtain county shapefile
Since the data frame doesn’t have any “shapes”, I have to download a shapefile with NJ’s counties to join to the NJ_covid19 data frame. Luckily, the New Jersey Office of GIS has a great website to download hundreds of spatial datasets across the state. By going to this website, I can download the county shapefile.
Read in county shapefile
I can read in the NJ county shapefile by using the sf
package. The sf
package is a great package to analyze spatial data.
library(sf) NJ_counties<-st_read(getwd(),"New_Jersey_Counties") #function to read in shapefile ## Reading layer `New_Jersey_Counties' from data source `/Users/kevinzolea/Desktop/Personal_Website/content/post/covid19_nj' using driver `ESRI Shapefile' ## Simple feature collection with 21 features and 22 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 193684.7 ymin: 34945.75 xmax: 657059.7 ymax: 919549.4 ## epsg (SRID): 3424 ## proj4string: +proj=tmerc +lat_0=38.83333333333334 +lon_0=-74.5 +k=0.9999 +x_0=150000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
As you can see above, the shapefile got read into R. Now let’s plot it to make sure everythings okay.
plot(st_geometry(NJ_counties))
Join shapefile to data frame
In order to make a map with the original data, I have to join the shapefile, NJ_counties, with the NJ_covid19 data frame. This can be done by using the left_join()
function from the dplyr
package. The join will be based on the common column shared by both datasets, which would be the county column. First, I need to make the county column header in the NJ_counties dataset lowercase. This is so I can match the county columns from both datasets. Next, I have to make all the counties in the county column lowercase in both datasets, so they can match. See below.
names(NJ_counties)<-tolower(names(NJ_counties)) # Makes county column header lowercase NJ_counties$county<-tolower(NJ_counties$county)#Makes all rows in the county column lowercase NJ_covid19$county<-tolower(NJ_covid19$county)#Makes all rows in the county column lowercase
Now I can join the two datasets based on the county column in both.
NJ_covid19_shapes<-left_join(NJ_covid19,NJ_counties,by="county")%>% dplyr::select(date,county,state,cases,deaths,geometry)#selects only the columns of interest head(NJ_covid19_shapes) ## # A tibble: 6 x 6 ## date county state cases deaths geometry ## <date> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]> ## 1 2020-03-04 bergen New Je… 1 0 (((656201 783614.4, 656141.1 7834… ## 2 2020-03-05 bergen New Je… 2 0 (((656201 783614.4, 656141.1 7834… ## 3 2020-03-06 bergen New Je… 3 0 (((656201 783614.4, 656141.1 7834… ## 4 2020-03-06 camden New Je… 1 0 (((342764 423475.8, 342804.1 4234… ## 5 2020-03-07 bergen New Je… 3 0 (((656201 783614.4, 656141.1 7834… ## 6 2020-03-07 camden New Je… 1 0 (((342764 423475.8, 342804.1 4234…
Make map with ggplot2 and gganimate
All the data manipulation part is done, now I can start building the actual map. I will be using the ggplot2
package in conjuction with the gganimate
package to create the final animated time series map showing how covid19 spread throughout NJ’s counties.
library(ggplot2) #Used for plotting library(gganimate) #Used for animations library(RColorBrewer) #Used for color scale # Used to make new data frame an sf object # Must use st_as_sf in order to use geom_sf() to plot polygons NJ_covid19_shapes<-st_as_sf(NJ_covid19_shapes) # Makes plot with ggplot2 and gganimate to animate through the days covid_map<-ggplot()+ geom_sf(data = NJ_counties,fill = "white")+ geom_sf(data = NJ_covid19_shapes,aes(fill=cases))+ ggtitle("Spread of Covid19 Throughout New Jersey")+ xlab("")+ ylab("")+ labs(subtitle = "Date: {current_frame}", caption = "Date Source: The New York Times\nAuthor: Kevin Zolea")+ cowplot::background_grid(major = "none", minor = "none") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line = element_blank(), legend.background = element_blank(), legend.position=c(-0.3,0.8), plot.background = element_blank(), panel.background = element_blank(), legend.text = element_text(size=12), legend.title = element_text(colour="black", size=12, face="bold"), plot.title=element_text(size=20, face="bold",hjust =0.5), plot.subtitle = element_text(hjust = 0.5,size=12), plot.caption = element_text(size = 11, hjust = .5, color = "black", face = "bold"))+ scale_fill_distiller("Number of Positive Cases", palette ="Reds",type = "div", direction = 1)+ transition_manual(date) animate(covid_map, nframe=27,fps = 2, end_pause = 15,height = 500, width =500)
Things to note about above code
Being that I joined a non-spatial data set with a spatial data set, I had to use st_as_sf()
to make sure the geom_sf()
function was able to plot an sf object. I had to use transition_manual()
to make the animation. This is because when I tried to use transition_time()
it made the polygons on the map move all over the place. The animate
function allows you to change how fast the animation is, how many frames to use, height, width, etc. To learn more about the gganimate
package and the ggplot2
package, you can go to the following websites to get a better understanding.
Conclusion
As you can see, the vast majority of positive cases can be seen in the northern parts of NJ. This is what can be expected due to the fact that the population is greater and its in close proximity to NY, which is the hardest hit state in the US.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.