Project Tycho, ggplot2 and the shameless stealing of blog ideas
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.
Someone also beat me to the standard response whenever base R graphics are used.
And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.
Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.
Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”
You know what: since we’re close to the 60th anniversary of Salk’s polio vaccine field trial results, let’s do poliomyelitis instead of measles. And let’s normalise cases per 100 000 population, since Mick and Ben did not. There, I claim novelty.
1. Getting the disease data
Follow Mick’s instructions, substituting POLIOMYELITIS for MEASLES. The result is a CSV file. Except: the first 2 lines are not CSV and the header row contains an extra blank field (empty quotes), not found in the data rows. The simplest way to deal with this was to use read.csv(), which auto-creates the extra column (62), calls it “X” and fills it with NA. You can then ignore it. Here’s a function to read the CSV file and aggregate cases by year.
library(plyr) library(reshape2) library(ggplot2) readDiseaseData <- function(csv) { dis <- read.csv(csv, skip = 2, header = TRUE, stringsAsFactors = FALSE) dis[, 3:62] <- sapply(dis[, 3:62], as.numeric) dis.m <- melt(dis[, c(1, 3:61)], variable.name = "state", id.vars = "YEAR") dis.agg <- aggregate(value ~ state + YEAR, dis.m, sum) return(dis.agg) }
We supply the downloaded CSV file as an argument:
polio <- readDiseaseData("POLIOMYELITIS_Cases_1921-1971_20150410002846.csv")
and here are the first few rows:
head(polio) state YEAR value 1 NEBRASKA 1921 1 2 NEW.JERSEY 1921 1 3 NEW.YORK 1921 40 4 NEW.YORK 1922 2 5 MASSACHUSETTS 1923 1 6 NEW.YORK 1923 8
2. Getting the population data
I discovered US state population estimates for the years 1900 – 1990 at the US Census Bureau. The URLs are HTTPS, but omitting the “s” works fine. The data are plain text…which is good but…although the data are somewhat structured (delimited), the files themselves vary a lot. Some contain only estimates, others contain in addition census counts. For earlier decades the numbers are thousands with a comma (so “1,200” = 1 200 000). Later files use millions with no comma. The decade years are split over several lines with different numbers of lines before and inbetween.
To make a long story short, any function to read these files requires many parameters to take all this into account and it looks like this:
getPopData <- function(years = "0009", skip1 = 23, skip2 = 81, rows = 49, names = 1900:1909, keep = 1:11) { u <- paste("http://www.census.gov/popest/data/state/asrh/1980s/tables/st", years, "ts.txt", sep = "") p1 <- read.table(u, skip = skip1, nrows = rows, header = F, stringsAsFactors = FALSE) p2 <- read.table(u, skip = skip2, nrows = rows, header = F, stringsAsFactors = FALSE) p12 <- join(p1, p2, by = "V1") p12 <- p12[, keep] colnames(p12) <- c("state", names) # 1900-1970 are in thousands with commas if(as.numeric(substring(years, 1, 1)) < 7) { p12[, 2:11] <- sapply(p12[, 2:11], function(x) gsub(",", "", x)) p12[, 2:11] <- sapply(p12[, 2:11], as.numeric) p12[, 2:11] <- sapply(p12[, 2:11], function(x) 1000*x) } return(p12) }
So now we can create a list of data frames, one per decade, then use plyr::join_all to join on state and get a big date frame of 51 states x 91 years with population estimates.
popn <- list(p1900 = getPopData(), p1910 = getPopData(years = "1019", names = 1910:1919), p1920 = getPopData(years = "2029", names = 1920:1929), p1930 = getPopData(years = "3039", names = 1930:1939), p1940 = getPopData(years = "4049", skip1 = 21, skip2 = 79, , names = 1940:1949), p1950 = getPopData(years = "5060", skip1 = 27, skip2 = 92, rows = 51, names = 1950:1959, keep = c(1, 3:7, 9:13)), p1960 = getPopData(years = "6070", skip1 = 24, skip2 = 86, rows = 51, names = 1960:1969, keep = c(1, 3:7, 9:13)), p1970 = getPopData(years = "7080", skip1 = 14, skip2 = 67, rows = 51, names = 1970:1979, keep = c(2:8, 11:14)), p1980 = getPopData(years = "8090", skip1 = 11, skip2 = 70, rows = 51, names = 1980:1990, keep = 1:12)) popn.df <- join_all(popn, by = "state", type = "full")
3. Joining the datasets
Next step: join the disease and population data. Although we specified states in the original data download, it includes things that are not states like “UPSTATE.NEW.YORK”, “DISTRICT.OF.COLUMBIA” or “PUERTO.RICO”. So let’s restrict ourselves to the 50 states helpfully supplied as variables in R. First we create a data frame containing state names and abbreviations, then match the abbreviations to the polio data.
statenames <- toupper(state.name) statenames <- gsub(" ", ".", statenames) states <- data.frame(sname = statenames, sabb = state.abb) m <- match(polio$state, states$sname) polio$abb <- states[m, "sabb"]
Now we can melt the population data, join to the polio data on state abbreviation and calculate cases per 100 000 people.
popn.m <- melt(popn.df) colnames(popn.m) <- c("abb", "YEAR", "pop") popn.m$YEAR <- as.numeric(as.character(popn.m$YEAR)) polio.pop <- join(polio, popn.m, by = c("YEAR", "abb")) polio.pop$cases <- (100000 / polio.pop$pop) * polio.pop$value head(polio.pop) state YEAR value abb pop cases 1 NEBRASKA 1921 1 NE 1309000 0.07639419 2 NEW.JERSEY 1921 1 NJ 3297000 0.03033060 3 NEW.YORK 1921 40 NY 10416000 0.38402458 4 NEW.YORK 1922 2 NY 10589000 0.01888752 5 MASSACHUSETTS 1923 1 MA 4057000 0.02464876 6 NEW.YORK 1923 8 NY 10752000 0.07440476
Success! Let’s get plotting.
4. Plotting
We should really indicate where data are missing but for the purposes of this post, I’ll just drop incomplete rows using na.omit().
Technically my first attempt is an abuse of geom_dotplot, but I think it generates quite a nice effect (assuming you’re not colour-blind). Note that years have to be factorised here.
ggplot(na.omit(polio.pop)) + geom_dotplot(aes(x = factor(YEAR), fill = cases), color = "white", binwidth = 1, dotsize = 0.4, binpositions = "all", method = "histodot") + facet_grid(abb~.) + theme_bw() + scale_fill_continuous(low = "floralwhite", high = "red") + geom_vline(xintercept = 32) + scale_y_discrete(breaks = NULL) + theme(panel.border = element_blank(), strip.text.y = element_text(angle = 0)) + scale_x_discrete(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")
For a shape more like the WSJ plots, I use geom_rect. This plot is generated quite a lot faster.
ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases)) + facet_grid(abb~.) + theme_bw() + scale_y_discrete(breaks = NULL) + scale_fill_continuous(low = "floralwhite", high = "red") + theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")
Finally, let’s try the colour palette generated by Mick. From his post, it’s clear that the WSJ fiddled with bin sizes and break points to generate more yellow/orange/red for pre-vaccine years. I haven’t bothered with that here, so things look a little different.
cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10), colorRampPalette(c("yellow", "red"))(30)) ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases), color = "white") + facet_grid(abb~.) + theme_bw() +scale_y_discrete(breaks = NULL) + scale_fill_gradientn(colours = cols) + theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")
Summary
Not too much work for some quite attractive output, thanks to great R packages; Hadley, love your work.
As ever, the main challenge is getting the raw data into shape. At some point I’ll wrap all this up as Rmarkdown and send it off to RPubs.
Filed under: R, statistics Tagged: disease, ggplot2, project tycho, vaccination
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.