Plane Crash Data – Part 1: Web Scraping
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Several months ago I stumbled across the Kaggle data set Airplane Crashes Since 1908. Since I couldn’t find the data source, I searched the web for historical plane crash data and quickly found the web page http://www.planecrashinfo.com. On this site you can find various tables inside tables with lots of information on aviation accidents of the last century. What you also find is a nice exercise on web scraping to collect the data on the web by your own.
This is what I’ll do in part one of the plane crash series. The second part treats google maps API requests, here I’m going to retrieve the geocoordinates for the locations in the data set on plane crashes. In the last part of the series I will visualize the flight routes and the crash locations using the leaflet package in R.
<pre class="r"><code># load required packages library("dplyr") library("xml2") library("rvest") library("tidyr") library("magrittr")
In order to understand the structure of the website we are going to query, I advise you to go ahaed and visit http://www.planecrashinfo.com, http://www.planecrashinfo.com/2016/2016 and for example http://www.planecrashinfo.com/2016/2016-1. The first site is the landing page. You find recent plane crashes and incidents there:
If you enter the second page, you can find data on all plane crashes happening in 2016:
By adding a -1 to the url, you access the first entry in the table and get detailed information on that:
So, we first generate the URLs for the years 2014, 2015, and 2016.
<pre class="r"><code># select from which years to get plane crash data and create URL for each years <- c("2014", "2015", "2016") urlYears <- paste0("http://www.planecrashinfo.com/", years, "/", years, ".htm") </code>
Next, we enter the table of the respective year, retrieve the number of entries in the table, and generate specific URLs for all entries.
<pre class="r"><code># get URL for each plane crash table from each year selected urls <- mapply(function(urlYear, year) { rowNumber <- urlYear %>% read_html %>% html_table(header = TRUE) %>% .[[1]] %>% nrow urlsSpec <- paste0("http://www.planecrashinfo.com/", year, "/", year, "-", 1:rowNumber, ".htm") }, urlYears, years ) %>% unlist </code>
Now we can access the specific tables and retrieve the information.
<pre class ="r"><code># retrieve information data <- lapply(urls, function(url) { url %>% # read each crash table read_html %>% html_table %>% data.frame %>% setNames(c("Vars", "Vals")) %>% # header is a colunm and values are in a column -> tidy up spread(Vars, Vals) %>% .[-1] })</code>
We put together the list of data frames to one data set and choose appropriate variable names. We remove the colon and use lower case letters for the variable names.
<pre class="r"><code># data list to data.frame and set appropriate variable names data %<>% bind_rows %>% setNames(gsub(":", "", colnames(.)) %>% tolower)</code>
Now we choose only the variables relevant for our analysis.
<pre class="r"><code># pick relevant variables data %<>% select(date, location, route, operator, aboard, fatalities) </code>
Next, we want to clean up the strings contained in the location variable.
<pre class="r"><code># clean location variable data %<>% mutate(location = gsub("Near | near", "", location)) %>% mutate(location = gsub(",.*,", ",", location))</code>
Finally, we extract the number of people aboard by splitting the strings of the variable aboard and chosing the first element. We do the same with the variable fatalities in order to extract the number of people who died. Lastly, we calculate a ratio, giving us the percentage of people who died.
<pre class="r"><code># extract total number of people aboard and fatalities data$aboard <- strsplit(data$aboard, " ") %>% lapply("[[", 1) %>% as.numeric data$fatalities <- strsplit(data$fatalities, " ") %>% lapply("[[", 1) %>% as.numeric data$deathRate <- round((data$fatalities/data$aboard) * 100, 2)</code>
Now the data is ready to be complemented with geocoordinates and then visualised.
Further parts of the Plane Crash series:
- Plane Crash Data - Part 1: Web Scraping
- Plane Crash Data - Part 2: Google Maps Geocoding API Request
- Plane Crash Data - Part 3: Visualisation
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.