More Airline Crashes via the Hadleyverse
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I saw a fly-by #rstats
mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a GitHub repo by @philjette. It seems there’s a web site (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:
Wrote some R code for looking at historical crash data
https://t.co/bgrMj3PIZu
#data #r #DataMining pic.twitter.com/zYzpOyD9JY
— PhilJ (@philjette) March 26, 2015
The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. dplyr
, rvest
, etc). I even submitted a pull request to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny ggplot2
graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.
Just to get the setup bits out of the way, here’s all the packages I’ll be using:
library(dplyr) library(rvest) library(magrittr) library(stringr) library(pbapply) |
Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.
The basic setup is to:
- grab the HTML for the page of a given year
- extract and format the crash dates
- extract location & operator information, which is made slightly annoying since the site uses a
<br>
and includes spurious newlines within a single<td>
element - extract aircraft type and registration (same issues as previous column)
- extract accident details, which are embedded in a highly formatted column that requires
str_match_all
to handle (well)
Some things worth mentioning:
data_frame
is super-helpful in not-creatingfactors
from the character vectorsbind_rows
andbind_cols
are a nice alternative to usingdata.table
functions- I think
stringr
needs a more pipe-friendly replacement forgsub
and, perhaps, evenifesle
(yes, I guess I could submit a PR). The.
just feels wrong in pipes to me, still - if you’re not using
pbapply
functions (free progress bars for everyone!) you should be, especially for long scraping operations - sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary
Here’s the new get_data
function:
#' retrieve crash data for a given year #' defaults to current year #' earliest year in the database is 1920 get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) { if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) { stop("year must be >=1920 and <=current year", call.=FALSE) } # get crash date pg <- html(sprintf(crash_base, year, year)) pg %>% html_nodes("table > tr > td:nth-child(1)") %>% html_text() %>% extract(-1) %>% as.Date(, format="%d %b %Y") %>% data_frame(date=.) -> date # get location and operator loc_op <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% gsub("(^[[:space:]]*|[[:space:]]*$)", "", .) %>% gsub("^(Near|Off) ", "", .) -> loc pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% gsub("(^[[:space:]]*|[[:space:]]*$|\n)", "", .) -> op data_frame(location=loc, operator=op) })) # get type & registration type_reg <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% gsub("(^[[:space:]]*|[[:space:]]*$|\n)", "", .) %>% ifelse(.=="?", NA, .) -> typ pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% gsub("(^[[:space:]]*|[[:space:]]*$|\n)", "", .) %>% ifelse(.=="?", NA, .) -> reg data_frame(type=typ, registration=reg) })) # get fatalities pg %>% html_nodes("table > tr > td:nth-child(4)") %>% html_text() %>% str_match_all("([[:digit:]]+)/([[:digit:]]+)\(([[:digit:]]+)\)") %>% lapply(function(x) { data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4])) }) %>% bind_rows %>% tail(-1) -> afg bind_cols(date, loc_op, type_reg, afg) } |
While that gets one year, it’s super-simple to get all crashes since 1950:
crashes <- bind_rows(pblapply(1950:2015, get_data)) |
Yep. That’s it. Now crashes
contains a data.frame
(well, tbl_df
) of all the crashes since 1950, ready for further analysis.
For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.
If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.