Plane Crash Data – Part 1: Web Scraping

[This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Several months ago I stumbled across the Kaggle data set Airplane Crashes Since 1908. Since I couldn’t find the data source, I searched the web for historical plane crash data and quickly found the web page http://www.planecrashinfo.com. On this site you can find various tables inside tables with lots of information on aviation accidents of the last century. What you also find is a nice exercise on web scraping to collect the data on the web by your own.

This is what I’ll do in part one of the plane crash series. The second part treats google maps API requests, here I’m going to retrieve the geocoordinates for the locations in the data set on plane crashes. In the last part of the series I will visualize the flight routes and the crash locations using the leaflet package in R.

<pre class="r"><code># load required packages library("dplyr") library("xml2") library("rvest") library("tidyr") library("magrittr")

In order to understand the structure of the website we are going to query, I advise you to go ahaed and visit http://www.planecrashinfo.com, http://www.planecrashinfo.com/2016/2016 and for example http://www.planecrashinfo.com/2016/2016-1. The first site is the landing page. You find recent plane crashes and incidents there:

http://www.planecrashinfo.com

If you enter the second page, you can find data on all plane crashes happening in 2016:

Overview of accidents in 2016 (planecrashinfo.com/2016/2016)

By adding a -1 to the url, you access the first entry in the table and get detailed information on that:

Details about first crash in 2016 (planecrashinfo.com/2016/2016-1)

So, we first generate the URLs for the years 2014, 2015, and 2016.

<pre class="r"><code># select from which years to get plane crash data and create URL for each years <- c("2014", "2015", "2016") urlYears <- paste0("http://www.planecrashinfo.com/", years,                     "/", years, ".htm") </code>

Next, we enter the table of the respective year, retrieve the number of entries in the table, and generate specific URLs for all entries.

<pre class="r"><code># get URL for each plane crash table from each year selected urls <- mapply(function(urlYear, year) {   rowNumber <- urlYear %>%      read_html %>%      html_table(header = TRUE) %>%      .[[1]] %>% nrow      urlsSpec <- paste0("http://www.planecrashinfo.com/", year,                       "/", year, "-", 1:rowNumber, ".htm") }, urlYears, years ) %>% unlist </code>

Now we can access the specific tables and retrieve the information.

<pre class ="r"><code># retrieve information  data <- lapply(urls, function(url) {   url %>%      # read each crash table     read_html %>%      html_table %>%      data.frame %>%       setNames(c("Vars", "Vals")) %>%     # header is a colunm and values are in a column -> tidy up     spread(Vars, Vals) %>%      .[-1] })</code>

We put together the list of data frames to one data set and choose appropriate variable names. We remove the colon and use lower case letters for the variable names.

<pre class="r"><code># data list to data.frame and set appropriate variable names data %<>%    bind_rows %>%    setNames(gsub(":", "", colnames(.)) %>% tolower)</code>

Now we choose only the variables relevant for our analysis.

<pre class="r"><code># pick relevant variables data %<>% select(date, location, route, operator, aboard, fatalities) </code>

Next, we want to clean up the strings contained in the location variable.

<pre class="r"><code># clean location variable data %<>% mutate(location = gsub("Near | near", "", location)) %>%    mutate(location = gsub(",.*,", ",", location))</code>

Finally, we extract the number of people aboard by splitting the strings of the variable aboard and chosing the first element. We do the same with the variable fatalities in order to extract the number of people who died. Lastly, we calculate a ratio, giving us the percentage of people who died.

<pre class="r"><code># extract total number of people aboard and fatalities  data$aboard <- strsplit(data$aboard, " ") %>%    lapply("[[", 1)  %>%    as.numeric data$fatalities <- strsplit(data$fatalities, " ") %>%    lapply("[[", 1) %>%    as.numeric data$deathRate <- round((data$fatalities/data$aboard) * 100, 2)</code>

Now the data is ready to be complemented with geocoordinates and then visualised.

Further parts of the Plane Crash series:

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)