Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m a huge fan of ProPublica. They have a super-savvy tech team, great reporters, awesome graphics folks and excel at data-driven journalism. Plus, they give away virtually everything, including data, text, graphics & tools.
I was reading @USATODAY’s piece on lead levels in drinking water across America and saw they had a mini-interactive piece included with their findings. A quick view in Developer Tools revealed the JSON and an equally quick use of curlconverter
had the data behind the interactive loaded into R in seconds (though it turns out just the straight URL to the JSON file works without the extra cURL
parameters).
I wanted to grab the data since, while I feel bad about Texas having 183 testing exceedances, I wanted to see if there’s a worse story if you normalize by population. I performed such a normalization by state (normalizing to per-100k population), but the data includes county information so a logical & necessary next step is to do that calculation as well (which a task left for another day since I think it’ll require some more munging as they used county names vs FIPS codes in the JSON and I suspect the names aren’t prefect). If any reader jumps in to do the county analysis before I do, drop a note here and I’ll update the post with a link to it. Here’s the outcome (top 10 worst states, normalized by population):
Even at the county-level, the normalization isn’t perfect since these are testing levels at specific locations per water provider in a county. But, I made an assumption (it may be wrong, I don’t do macro-level water quality analysis for a living) that 94 exceedances in a small (both in size & population) state like mine (Maine) is probably worse than 183 exceedances across a population the size of Texas. I think I’d need to take population density per county area into account to fully address this, but this is a first stab at the data and this post was written to talk about how to use ProPublica’s stateface in R vs do investigative data journalism 🙂
What’s a ‘stateface’?
The fine folks at ProPublica took the Natural Earth shapefiles and made an icon out of them. They have full code (with some super spiffy, clean, readable C code you can reuse) in their github repo, but they also provide just the . However, it’s in OTF format and R kinda needs it in TTF format for widest graphics device support, so I converted it for you (& don’t get me started about s in R). You’ll need that loaded to run the example. Note that ProPublica is still the license owner of that converted (both the OTF & TTF are free to use, but give credit where it’s due…they did all the hard work). Aannnd it looks like they already had it created (I had only looked at the zip file).
ProPublica provides many support files to use these on the web, since that’s their target environment. They do provide key mappings for the , which makes it usable in virtually any context s are supported. We’ll take advantage of those mappings in a bit, but you should check out their link above, they have some neat examples using these shapes as “sparkmaps” (i.e. inline in a sentence).
Using ‘stateface’ in R
After loading the into your system, it’s super-easy to use it in R. The family
name is “StateFace-Regular
” and this is the translation table from 2-digit state abbreviation to glyph character:
state_trans <- c(AL='B', AK='A', AZ='D', AR='C', CA='E', CO='F', CT='G', DE='H', DC='y', FL='I', GA='J', HI='K', ID='M', IL='N', IN='O', IA='L', KS='P', KY='Q', LA='R', ME='U', MD='T', MA='S', MI='V', MN='W', MS='Y', MO='X', MT='Z', NE='c', NV='g', NH='d', NJ='e', NM='f', NY='h', NC='a', ND='b', OH='i', OK='j', OR='k', PA='l', RI='m', SC='n', SD='o', TN='p', TX='q', UT='r', VT='t', VA='s', WA='u', WV='w', WI='v', WY='x', US='z') |
You now only need to do:
state_trans["ME"] ## ME ## "U" |
to get the mappings.
This is the (annotated) code to generate the bar chart above:
library(dplyr) # munging library(ggplot2) # plotting, req: devtools::intstall_github("hadley/ggplot2") library(scales) # plotting helpers library(hrbrmisc) # my themes # read in exceedance state/county data dat <- read.csv("http://rud.is/dl/h2olead.csv", stringsAsFactors=FALSE) # this is how USA TODAY's computation works, I'm just following it xcd <- count(distinct(dat, state, county, name), state, wt=exceedances) # get U.S. state population estimates for 2015 us_pop <- setNames(read.csv("http://www.census.gov/popest/data/state/totals/2015/tables/NST-EST2015-01.csv", skip=9, nrows=51, stringsAsFactors=FALSE, header=FALSE)[,c(1,9)], c("state_name", "pop")) us_pop$state_name <- sub("^\.", "", us_pop$state_name) us_pop$pop <- as.numeric(gsub(",", "", us_pop$pop)) # join them to the exceedance data state_tbl <- data_frame(state_name=state.name, state=tolower(state.abb)) us_pop <- left_join(us_pop, state_tbl) xcd <- left_join(xcd, us_pop) # compute the exceedance by 100k population & order the # states by that so we get the right bar order in ggplot xcd$per100k <- (xcd$n / xcd$pop) * 100000 xcd$state_name <- factor(xcd$state_name, levels=arrange(xcd, per100k)$state_name) xcd <- arrange(xcd, desc(state_name)) # get the top 10 worse exceedances top_10 <- head(xcd, 10) # map (heh) the stateface glpyh character to the state 2-letter code top_10$st <- state_trans[toupper(top_10$state)] gg <- ggplot(top_10, aes(x=state_name, y=per100k)) gg <- gg + geom_bar(stat="identity", width=0.75) # here's what you need to do to place the stateface glyphs gg <- gg + geom_text(aes(x=state_name, y=0.25, label=st), family="StateFace-Regular", color="white", size=5, hjust=0) gg <- gg + geom_text(aes(x=state_name, y=per100k, label=sprintf("%s total ", comma(n))), hjust=1, color="white", family="KerkisSans", size=3.5) gg <- gg + scale_x_discrete(expand=c(0,0)) gg <- gg + scale_y_continuous(expand=c(0,0)) gg <- gg + coord_flip() gg <- gg + labs(x=NULL, y=NULL, title="Lead in the water: A nationwide look; Top 10 impacted states", subtitle="Exceedance count adjusted per 100K population; total exceedance displayed", caption="Data from USA TODAY's compliation of EPA’s Safe Drinking Water Information System (SDWIS) database.") # you'll need the Kerkis loaded to use this theme # http://myria.math.aegean.gr/kerkis/ gg <- gg + theme_hrbrmstr_kerkis(grid=FALSE) # I neee to fiddle with the theme settings so these line height tweaks # aren't necessary in the future gg <- gg + theme(plot.caption=element_text(lineheight=0.7)) gg <- gg + theme(plot.title=element_text(lineheight=0.7)) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(panel.margin=margin(t=5, b=5, l=20, r=20, "pt")) gg |
Why ‘stateface’?
People love maps and the bars seemed, well, lonely 🙂 Seriously, though, this (IMO) provides a nice trade off between full-on choropleths (which are usually not warranted) and bland, basic bars. The adornments here may help readers engage with the chart a bit more then they otherwise would.
Fin
As I stated, this analysis probably needs to be at the county population level to have the most efficacy, and you’d need to come up with a weighted roll-up mechanism to rank the states properly. I still think this naive normalization is better than using the raw exceednace counts, but I’d love to have comments from folks who do this macro water testing analysis for a living!
Full code is in this gist.
And, woo hoo!, Maine is #1 in blueberries, lobsters & insane governors and #3 in dangerous lead levels in public water systems.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.