New R Package: cdcfluview — Retrieve Flu Data from CDC’s FluView Portal
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Towards the end of 2014 I had been tinkering with flu data from the CDC’s FluView portal since flu reports began to look like this season was going to go the way of 2009.
While you can track the flu over at The Washington Post, I like to work with data on my own. However the CDC’s portal is Flash-driven and there was no obvious way to get the data files programmatically. This is unfortunate, since there are weekly updates to the data set.
As an information security professional, one of the tools in my arsenal is Burp Proxy, which is an application that—amongst other things—lets you configure a local proxy server for your browser and inspect all web requests. By using this tool, I was able to discern that the Flash portal calls out to http://gis.cdc.gov/grasp/fluview/FluViewPhase2CustomDownload.ashx
with custom POST
form parameters (that I also mapped out) to make the data sets it delivers back to the user.
With that information in hand, I whipped together a small R package: cdcfluview to interface with the same server the FluView Portal does. It has a singular function – get_flu_data
that lets you choose between different region/sub-region breakdowns and also whether you want data from WHO, ILINet (or both). It also lets you pick which years you want data for.
One reason I wanted to work with the data was to see just how this season differs from previous ones. The view I’ll leave on the blog this time—mostly as an example of how to use the package—is a faceted chart, by CDC region and CDC week showing this season (in red) as it relates to previous ones.
library(cdcfluview) library(magrittr) library(dplyr) dat <- get_flu_data(region="hhs", sub_region=1:10, data_source="ilinet", years=2000:2014) dat %<>% mutate(REGION=factor(REGION, levels=unique(REGION), labels=c("Boston", "New York", "Philadelphia", "Atlanta", "Chicago", "Dallas", "Kansas City", "Denver", "San Francisco", "Seattle"), ordered=TRUE)) %>% mutate(season_week=ifelse(WEEK>=40, WEEK-40, WEEK), season=ifelse(WEEK<40, sprintf("%d-%d", YEAR-1, YEAR), sprintf("%d-%d", YEAR, YEAR+1))) prev_years <- dat %>% filter(season != "2014-2015") curr_year <- dat %>% filter(season == "2014-2015") curr_week <- tail(dat, 1)$season_week gg <- ggplot() gg <- gg + geom_point(data=prev_years, aes(x=season_week, y=X..WEIGHTED.ILI, group=season), color="#969696", size=1, alpa=0.25) gg <- gg + geom_point(data=curr_year, aes(x=season_week, y=X..WEIGHTED.ILI, group=season), color="red", size=1.25, alpha=1) gg <- gg + geom_line(data=curr_year, aes(x=season_week, y=X..WEIGHTED.ILI, group=season), size=1.25, color="#d7301f") gg <- gg + geom_vline(xintercept=curr_week, color="#d7301f", size=0.5, linetype="dashed", alpha=0.5) gg <- gg + facet_wrap(~REGION, ncol=3) gg <- gg + labs(x=NULL, y="Weighted ILI Index", title="ILINet - 1999-2015 year weighted flu index history by CDC regionnWeek Ending Jan 3, 2015 (Red == current season)n") gg <- gg + theme_bw() gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(strip.background=element_blank()) gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg |
(You can see an SVG version of that plot here)
Even without looking at the statistics, it’s pretty easy to tell that this is fixing to be a pretty bad season in many regions.
As always, post bugs or feature requests on the github repo and drop a note here if you’ve found the package useful or have some other interesting views or analyses to share.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.