Animated US Hexbin Map of the Avian Flu Outbreak

hrbrmstr

7 years ago

[This article was first published on rud.is » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The recent announcement of the start of egg rationing in the U.S. made me curious enough about the avian flu outbreak to try to dig into the numbers a bit. I finally stumbled upon a USDA site that had an embedded HTML table of flock outbreak statistics by state, county and date (also flock type and whether it was a commercial enterprise or “backyard” farm). Just looking at the sum of flock sizes on that page shows that nearly 50 million birds have been impacted since December, 2014.

We can scrape the data with R & rvest and then use the shapefile hexbins from previous posts to watch the spread week-over-week.

The number of packages I ended up relying on was a bit surprising. Let’s get them out of the way before focusing on the scraping and hexbin-making:

library(rvest)     # scraping
library(stringr)   # string manipulation
library(lubridate) # date conversion
library(dplyr)     # data mjnging
library(zoo)       # for locf
library(ggplot2)   # plotting
library(rgdal)     # map stuff
library(rgeos)     # map stuff

We also end up using magrittr and tidyr but only for one function, so you’ll see those with :: in the code.

Grabbing the USDA page is pretty straightforward:

url <- "http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/ct_avian_influenza_disease/!ut/p/a1/lVJbb4IwFP41e1qwFZDLI-oUnGgyswm8kAMUaAaFQNG4X7-ibnEPYtakDz3nO_kupyhAHgoYHGgGnFYMiv4daOFqa8vjKZad5c58wc7mY-Eaa13Z2qoA-AKA7xwL_53fvjpaP_-Gp_Z8jHcK2qMABTHjNc-RD3VO2zCuGCeMhwWNGmhOT7iFsOqaMK3irj2_gNESijAnUPD8tpLQlkBLQsrSqinPJi7tAwX2i4_5tSBgRUfYF_wM9mLqmCbIj2QzxZpMJMUYg6TGkSLBBCaSPEnSJIljXVH0q_kBdw_CO5sXkNnSslV9LQJTDRk7czGumy7GjnYFDOTrCw36XRJTRbt_mlo9VD1FgfuctqrVByA37szNBAPwXOpzR97gPi7tm30gb2AfQkxWVJH4ifvZLavFIsUQrA1JSUOaUV61HHnH43HUtQmMsuqA6vK9NJST9JluNlKwyPr7DT6YvRs!/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_pacific_flyway"
 
#' read in the data, extract the table and clean up the fields
#' also clean up the column names to since they are fairly nasty
 
pg <- html(url)

If you poke at the source for the page you’ll see there are two tables in the code and we only need the first one. Also, if you scan the rendered table on the USDA page by eye you’ll see that the column names are horrible for data analysis work and they are also inconsistent in the values used for various columns. Furthermore, there are commas in the flock counts and it would be handy to have the date as an actual date type. We can extract the table we need and clean all that up in a reasonably-sized dplyr pipe:

pg %>%
  html_nodes("table") %>%
  magrittr::extract2(1) %>%
  html_table(header=TRUE) %>%
  filter(`Flock size`!="pending") %>%
  mutate(Species=str_replace(tolower(Species), "s$", ""),
         `Avian influenza subtype*`=str_replace_all(`Avian influenza subtype*`, " ", ""),
         `Flock size`=as.numeric(str_replace_all(`Flock size`, ",", "")),
         `Confirmation date`=as.Date(mdy(`Confirmation date`))) %>%
  rename(state=State, county=County, flyway=Flyway, flock_type=`Flock type`,
         species=Species, subtype=`Avian influenza subtype*`, date=`Confirmation date`,
         flock_size=`Flock size`) -> birds

Let’s take a look at what we have:

glimpse(birds)
 
## Observations: 202
## Variables:
## $ state      (chr) "Iowa", "Minnesota", "Minnesota", "Iowa", "Minnesota", "Iowa",...
## $ county     (chr) "Sac", "Renville", "Renville", "Hamilton", "Kandiyohi", "Hamil...
## $ flyway     (chr) "Mississippi", "Mississippi", "Mississippi", "Mississippi", "M...
## $ flock_type (chr) "Commercial", "Commercial", "Commercial", "Commercial", "Comme...
## $ species    (chr) "turkey", "chicken", "turkey", "turkey", "turkey", "turkey", "...
## $ subtype    (chr) "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM...
## $ date       (date) 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-03, 2...
## $ flock_size (dbl) 42200, 415000, 24800, 19600, 37000, 26200, 17200, 1115700, 159...

To make an animated map of cumulative flock totals by week, we’ll need to

group the birds data frame by week and state
calculate the cumulative sums
fill in the gaps where there are missing state/week combinations
carry the last observations by state/week forward in this expanded data frame
make breaks for data ranges so we can more intelligently map them to colors

This ends up being a longer dplyr pipe than I usuall like to code (I think very long ones are hard to follow) but it gets the job done and is still pretty readable:

birds %>%
  mutate(week=as.numeric(format(birds$date, "%Y%U"))) %>%
  arrange(week) %>%
  group_by(week, state) %>%
  tally(flock_size) %>%
  group_by(state) %>%
  mutate(cum=cumsum(n)) %>%
  ungroup %>%
  select(week, state, cum) %>%
  mutate(week=as.Date(paste(week, 1), "%Y%U %u")) %>%
  left_join(tidyr::expand(., week, state), .) %>%
  group_by(state) %>%
  do(na.locf(.)) %>%
  mutate(state_abb=state.abb[match(state, state.name)],
         cum=as.numeric(ifelse(is.na(cum), 0, cum)),
         brks=cut(cum,
                  breaks=c(0, 200, 50000, 1000000, 10000000, 50000000),
                  labels=c("1-200", "201-50K", "50k-1m",
                           "1m-10m", "10m-50m"))) -> by_state_and_week

Now, we perform the standard animation steps:

determine where we’re going to break the data up
feed that into a loop
partition the data in the loop
render the plot to a file
combine all the individual images into an animation

For this graphic, I’m doing something a bit extra. The color ranges for the hexbin choropleth go from very light to very dark, so it would be helpful if the titles for the states went from very dark to very light, matching the state colors. The lines that do this check for state breaks that fall in the last two values and appropriately assign "black" or "white" as the color.

i <- 0
 
for (wk in unique(by_state_and_week$week)) {
 
  # filter by week
 
  by_state_and_week %>% filter(week==wk) -> this_wk
 
  # hack to let us color the state labels in white or black depending on
  # the value of the fill
 
  this_wk %>%
    filter(brks %in% c("1m-10m", "10m-50m")) %>%
    .$state_abb %>%
    unique -> white_states
 
  centers %>%
    mutate(txt_col="black") %>%
    mutate(txt_col=ifelse(id %in% white_states, "white", "black")) -> centers
 
  # setup the plot
 
  gg <- ggplot()
  gg <- gg + geom_map(data=us_map, map=us_map,
                      aes(x=long, y=lat, map_id=id),
                      color="white", fill="#dddddd", size=2)
  gg <- gg + geom_map(data=this_wk, map=us_map,
                      aes(fill=brks, map_id=state_abb),
                      color="white", size=2)
  gg <- gg + geom_text(data=centers,
                       aes(label=id, x=x, y=y, color=txt_col), size=4)
  gg <- gg + scale_color_identity()
  gg <- gg + scale_fill_brewer(name="Combined flock sizen(all types)",
                               palette="RdPu", na.value="#dddddd", drop=FALSE)
  gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
  gg <- gg + coord_map()
  gg <- gg + labs(x=NULL, y=NULL,
                  title=sprintf("U.S. Avian Flu Total Impact as of %sn", wk))
  gg <- gg + theme_bw()
  gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
  gg <- gg + theme(panel.border=element_blank())
  gg <- gg + theme(panel.grid=element_blank())
  gg <- gg + theme(axis.ticks=element_blank())
  gg <- gg + theme(axis.text=element_blank())
  gg <- gg + theme(legend.position="bottom")
  gg <- gg + theme(legend.direction="horizontal")
  gg <- gg + theme(legend.title.align=1)
 
  # save the image
 
  # i'm using "quartz" here since I'm on a Mac. Use what works for you system to ensure you
  # get the best looking output png
 
  png(sprintf("output/%03d.png", i), width=800, height=500, type="quartz")
  print(gg)
  dev.off()
 
  i <- i + 1
 
}

We could use one of the R animation packages to actually make the animation, but I know ImageMagick pretty well so I just call it as a system command:

system("convert -delay 60 -loop 1 output/*png output/avian.gif")

All that results in:

If that’s a static image, open it in a new tab/window (or just click on it). I really didn’t want to do a looping gif but if you do just make the -loop 1 into -loop 0.

Now, we can just re-run the code when the USDA refreshes the data.

The code, data and sample bitmaps are on github.

To leave a comment for the author, please follow the link and comment on their blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.