Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a project I recently faced the issue of getting a database of all aviation incidents. As I really wanted to try Hadley’s new rvest-package, I thought I will give it a try and share the code with you.
The data of aviation incidents starting in 1919 from the Aviation Safety Network can be found here: http://aviation-safety.net/database/.
First, we needed to install and load the rvest-package, as well as dplyr, which I love for removing lots of messy code (if you are unfamiliar with the piping-operator %>% have a look at this description: http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/)
install.packages("rvest")
install.packages("dplyr")
require("rvest")
require("dplyr")
Let’s try out some functions of rvest.
Say we want to read all incidents that happened in the year 1920: http://aviation-safety.net/database/dblist.php?Year=1920. We need to find the right html table to download and the link to it, to be more precise, the XPath. This can be done by using “inspect element” (right-click on the table, inspect element, right click on the element in the code and “copy XPath”). In our case the XPath is
“//*[@id=”contentcolumnfull”]/div/table”.
To load the html data to R we can use:
url <- "http://aviation-safety.net/database/dblist.php?Year=1920" # load the html code to R incidents1920 <- url %>% html() # filter for the right xpath node incidents1920 <- incidents1920 %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') # convert to a data.frame incidents1920 <- incidents1920 %>% html_table() %>% data.frame() # or in one go incidents1920 <- url %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% html_table() %>% data.frame()
Which gives us a small data.frame of 4 accidents.
But what happens if we have more than one page of data per year? We certainly don’t want to paste everything by hand. Take 1962 for example http://aviation-safety.net/database/dblist.php?Year=1962, which has 3 pages. Luckily we can get the number of pages by using rvest as well.
We follow the steps above to get the number of pages per year with the XPath “//*[@id=”contentcolumnfull”]/div/div[2]“, with some cleaning we get the maximum pagenumber as:
url <- "http://aviation-safety.net/database/dblist.php?Year=1962"
pages <- url %>% html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
html_text() %>% strsplit(" ") %>% unlist() %>%
as.numeric() %>% max()
pages
# [1] 3
Now we can write a small loop to get all incidents of 1962, as the link changes with the page number, ie from:
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=1
to
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=2
The code for the loop looks like this:
# initiate empty data.frame, in which we will store the data
dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0),
operator = numeric(0), fatalities = numeric(0),
location = numeric(0), category = numeric(0))
# loop through all page numbers
for (page in 1:pages){
# create the new URL for the current page
url <- paste0("http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=", page)
# get the html data and convert it to a data.frame
incidents <- url %>% html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
html_table() %>% data.frame()
# combine the data
dat <- rbind(dat, incidents)
}
# quick look at the dimensions of the data
dim(dat)
# [1] 211 9
which gives us a data.frame consisting of 211 incidents of the year 1962.
Lastly, we can write a loop to gather the data over multiple years:
# set-up of initial values
startyear <- 1960
endyear <- 1965
url_init <- "http://aviation-safety.net/database/dblist.php?Year="
# initiate empty dataframe, in which we will store the data
dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0),
operator = numeric(0), fatalities = numeric(0),
location = numeric(0), category = numeric(0))
for (year in startyear:endyear){
# get url for this year
url_year <- paste0(url_init, year)
# get pages
pages <- url_year %>% html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
html_text() %>% strsplit(" ") %>% unlist() %>%
as.numeric() %>% max()
# loop through the pages
for (page in 1:pages){
url <- paste0(url_year,"&lang=&page=", page)
# get the html data and convert it to a data.frame
incidents <- url %>% html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
html_table() %>% data.frame()
# combine the data
dat <- rbind(dat, incidents)
}
}
dim(dat)
# [1] 1268 9
In the years 1960-1965 there are 1.268 recorded aviation incidents, which we can now use in R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
