Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a project I recently faced the issue of getting a database of all aviation incidents. As I really wanted to try Hadley’s new rvest-package, I thought I will give it a try and share the code with you.
The data of aviation incidents starting in 1919 from the Aviation Safety Network can be found here: http://aviation-safety.net/database/.
First, we needed to install and load the rvest-package, as well as dplyr, which I love for removing lots of messy code (if you are unfamiliar with the piping-operator %>% have a look at this description: http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/)
install.packages("rvest") install.packages("dplyr") require("rvest") require("dplyr")
Let’s try out some functions of rvest.
Say we want to read all incidents that happened in the year 1920: http://aviation-safety.net/database/dblist.php?Year=1920. We need to find the right html table to download and the link to it, to be more precise, the XPath. This can be done by using “inspect element” (right-click on the table, inspect element, right click on the element in the code and “copy XPath”). In our case the XPath is
“//*[@id=”contentcolumnfull”]/div/table”.
To load the html data to R we can use:
url <- "http://aviation-safety.net/database/dblist.php?Year=1920" # load the html code to R incidents1920 <- url %>% html() # filter for the right xpath node incidents1920 <- incidents1920 %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') # convert to a data.frame incidents1920 <- incidents1920 %>% html_table() %>% data.frame() # or in one go incidents1920 <- url %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% html_table() %>% data.frame()
Which gives us a small data.frame of 4 accidents.
But what happens if we have more than one page of data per year? We certainly don’t want to paste everything by hand. Take 1962 for example http://aviation-safety.net/database/dblist.php?Year=1962, which has 3 pages. Luckily we can get the number of pages by using rvest as well.
We follow the steps above to get the number of pages per year with the XPath “//*[@id=”contentcolumnfull”]/div/div[2]“, with some cleaning we get the maximum pagenumber as:
url <- "http://aviation-safety.net/database/dblist.php?Year=1962" pages <- url %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>% html_text() %>% strsplit(" ") %>% unlist() %>% as.numeric() %>% max() pages # [1] 3
Now we can write a small loop to get all incidents of 1962, as the link changes with the page number, ie from:
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=1
to
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=2
The code for the loop looks like this:
# initiate empty data.frame, in which we will store the data dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0), operator = numeric(0), fatalities = numeric(0), location = numeric(0), category = numeric(0)) # loop through all page numbers for (page in 1:pages){ # create the new URL for the current page url <- paste0("http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=", page) # get the html data and convert it to a data.frame incidents <- url %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% html_table() %>% data.frame() # combine the data dat <- rbind(dat, incidents) } # quick look at the dimensions of the data dim(dat) # [1] 211 9
which gives us a data.frame consisting of 211 incidents of the year 1962.
Lastly, we can write a loop to gather the data over multiple years:
# set-up of initial values startyear <- 1960 endyear <- 1965 url_init <- "http://aviation-safety.net/database/dblist.php?Year=" # initiate empty dataframe, in which we will store the data dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0), operator = numeric(0), fatalities = numeric(0), location = numeric(0), category = numeric(0)) for (year in startyear:endyear){ # get url for this year url_year <- paste0(url_init, year) # get pages pages <- url_year %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>% html_text() %>% strsplit(" ") %>% unlist() %>% as.numeric() %>% max() # loop through the pages for (page in 1:pages){ url <- paste0(url_year,"&lang=&page=", page) # get the html data and convert it to a data.frame incidents <- url %>% html() %>% html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% html_table() %>% data.frame() # combine the data dat <- rbind(dat, incidents) } } dim(dat) # [1] 1268 9
In the years 1960-1965 there are 1.268 recorded aviation incidents, which we can now use in R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.