Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Over on The DO Loop, @RickWicklin does a nice job visualizing the causes of airline crashes in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:
- Give @jennybc’s new googlesheets a spin
- Show some
dplyr
&tidyr
data wrangling (never can have too many examples) - Crank out some
ggplot
zero-based streamgraph-y area charts for the data with some extraggplot
wrangling for good measure
I also decided to use the colors in the original David McCandless/Kashan visualization.
Getting The Data
As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site makes the data available, so I copied it to my Google Drive and gave her package a go:
library(googlesheets) library(ggplot2) # we'll need the rest of the libraries later library(dplyr) # but just getting them out of the way library(tidyr) # this will prompt for authentication the first time my_sheets <- list_sheets() # which one is the flight data one grep("Flight", my_sheets$sheet_title, value=TRUE) ## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" # get the sheet reference then the data from the second tab flights <- register_ss("Flight Risk JSON") flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL") # take a quick look glimpse(flights_csv) ## Observations: 440 ## Variables: ## $ date (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "... ## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1... ## $ loc (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "... ## $ country (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor... ## $ ref (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ... ## $ airline (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"... ## $ fat (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4... ## $ px (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"... ## $ cat (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1... ## $ phase (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi... ## $ cert (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi... ## $ (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h... ## $ cause (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err... ## $ notes (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... # the spreadsheet has a "helper" row for javascript, so we nix it flights_csv <- flights_csv[-1,] # js vars removal # and we convert some columns while we're at it flights_csv %>% mutate(date=as.Date(date), fat=as.numeric(fat), px=as.numeric(px)) -> flights_csv |
A Bit of Cleanup
Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.
The area chart also needs full values for each category combo per year, so we use expand
from tidyr
with left_join
and mutate
to fill in the gaps.
Finally, we make proper, ordered labels:
flights_csv %>% mutate(year=as.numeric(format(date, "%Y"))) %>% mutate(phase=tolower(phase), phase=ifelse(grepl("take", phase), "takeoff", phase), phase=ifelse(grepl("climb", phase), "takeoff", phase), phase=ifelse(grepl("ap", phase), "approach", phase)) %>% count(year, meta, phase) %>% left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% mutate(n=ifelse(is.na(n), 0, n)) %>% filter(!phase %in% c("grounded", "unknown")) %>% mutate(phase=factor(phase, levels=c("takeoff", "en_route", "approach", "landing"), labels=c("Takeoff", "En Route", "Approach", "Landing"), ordered=TRUE)) -> flights_dat |
I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.
The Final Plot
I’m a big fan of an incremental, additive build idiom for ggplot
graphics. By using the gg <- gg + …
style one can move lines around, comment them out, etc without dealing with errant +
signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows:
- main
ggplot
call with mappings if the graph is short, otherwise add the mappings to thegeom
s - all
geom_
orstat_
layers in the order I want them, and using line breaks to logically separate elements (likeaes
) or to wrap long lines for easier readability. - all
scale_
elements in order from axes to line to shape to color to fill to alpha; I’m not as consistent as I’d like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking facet
call (if any)- label setting, always with
labs
unless I really have a need for usingggtitle
- base
theme_
call - all other
theme
elements, one pergg <- gg +
line
I know that’s not everyone’s cup of tea, but it’s just how I roll ggplot
-style.
For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text . Substitute your own fav if you don’t have Futura Medium.
gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area", position="stack", method="gam", formula=y~s(x)) gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, labels=c("Criminal", "Human Error", "Mechanical", "Unknown", "Weather")) gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13)) gg <- gg + facet_grid(~phase) gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase") gg <- gg + theme_bw() gg <- gg + theme(legend.position="bottom") gg <- gg + theme(text=element_text(family="Futura Medium")) gg <- gg + theme(plot.title=element_text(face="bold", hjust=0)) gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(strip.background=element_rect(fill="#525252")) gg <- gg + theme(strip.text=element_text(color="white")) gg |
That ultimately produces:
with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the bump between 2005 and 2010 for landing crashes.
As an aside, Boeing has a really nice PDF on some of this data with quite a bit more detail.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.