Airline Performance Comparison with R/Shiny
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Open R shiny App from a new window here!
Play with the App here:
In this project I set out to build an interactive app to allow a user to compare airline performance between two cities in the Continental US in order to better inform his or her flying decisions. It is useful to have additional metrics beyond price alone when choosing an airline with which to build frequent flyer miles, get an airline-affiliated credit card, etc. For example: I currently live in New York but my immediate family is in Nashville, so since that is where I am flying most often I want to get a sense for which airlines offer the most flights as well as how they perform in terms of delays and cancellations.
The Shiny package in R provides a really nice and intuitive web application framework, so I was able to take advantage of its features to build an interactive app driven by US Department of Transportation airline on-time performance data. The app is linked above so feel free to try it out! In this post I’ll show some selections of the code that were used to build the app and explain them in more detail.
The source code is separated into three files each with a different function:
- A UI file (ui.R) which determines the look and feel of the app and provides inputs for the user to select.
- A “server” file (server.R) that generates output in response to the user’s input.
- A global file (global.R) to include the required R libraries, load the data and get it into the proper format (via merging, sorting, arranging, etc), and define any functions used in the server file.
First, here is the code for ui.R:
[code language="R"] shinyUI(pageWithSidebar( headerPanel('Airline Comparison Tool'), # make sidebar with user inputs sidebarPanel( selectInput('origin_select', label = 'Origin', choices = origin$CITY_NAME, selected = 'New York, NY'), selectInput('dest_select', label = 'Destination', choices = dest$CITY_NAME, selected = 'San Francisco, CA'), dateRangeInput('dateRange', label = 'Date Range', start = min(flights$FL_DATE), end = max(flights$FL_DATE)), selectInput('early_time', label = 'Departure Time (earliest)', choices = times, selected = '00:00'), selectInput('late_time', label = 'Departure Time (latest)', choices = times, selected = '24:00') ), # output plots to main panel with tabs to select type mainPanel( tabsetPanel(type = 'tabs', tabPanel('Flights', plotOutput('countPlot')), tabPanel('Delays', plotOutput('delayPlot')), tabPanel('Reason for Delay', plotOutput('typePlot')), tabPanel('Cancellations', plotOutput('cancelPlot')) ), # show flight path in main panel plotOutput('mapPlot') ) )) [/code]
Here we see how the basic layout of the app is defined. There is a sidebar with user inputs, including origin/destination as well as some filters based on date and departure time.
There are also a few tabs being created here to show the output plots from server.R in the main panel along with a flight path map for reference.
Next is a selection of the code from server.R:
[code language="R"] # show flight delays plot in main panel output$delayPlot = renderPlot({ # get delay by carrier subset_delay = filter(flights, ORIGIN_CITY_NAME == input$origin_select & DEST_CITY_NAME == input$dest_select & FL_DATE >= input$dateRange[1] & FL_DATE <= input$dateRange[2] & CRS_DEP_TIME >= input$early_time & CRS_DEP_TIME <= input$late_time) medians = group_by(subset_delay, CARRIER_NAME) %>% summarise(median(ARR_DELAY, na.rm = T)) medians_sorted = sort(unlist(medians[2])) delayTitle = paste("Arrival delay from", input$origin_select, "to", input$dest_select) # make plot p = ggplot(subset_delay, aes(x = reorder(CARRIER_NAME, ARR_DELAY, na.rm = TRUE, FUN = median), y = ARR_DELAY)) p + geom_boxplot(middle = medians_sorted, aes(fill = CARRIER_NAME)) + scale_fill_brewer(palette = "Set2", name = "Carriers") + ylim(-50, 50) + xlab('') + ylab('Arrival Delay (minutes)') + ggtitle(delayTitle) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1), text = element_text(size = 16)) + theme(legend.position="none") }) [/code]
The above code is only one segment of server.R in order to demonstrate how it takes the user’s input to filter the data (with dplyr) and then visualize it (with ggplot). This example is for the delay time boxplots, but the output for the other tabs is generated in a similar fashion.
The first tab in the application’s main panel is a histogram showing the number of flights broken down by carrier:
The next tab shows boxplots of the arrival delay time for each airline, where a negative delay time means that the flight arrived ahead of schedule:
The next tab breaks these delay times down the proportions of the type of delay. Here is a description for each type:
- Late Aircraft: previous flight arrived late.
- National Aviation System: non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.
- Weather: extreme weather that prevents or delays operation of a flight.
- Security: evacuation of a terminal, security breach, inoperative screening equipment, and long lines in excess of 29 minutes at screening areas.
- Carrier: circumstances within the airline’s control such as maintenance or crew problems, aircraft cleaning, baggage loading, fueling, etc.
The last tab is for number of cancellations, also broken down by the same types described above:
The last section of code is contained in the global.R file. I won’t go into much detail here, but the code in this file handles the loading, merging, sorting, and formatting of the data used by the application. However, I will highlight a function that is called from server.R to plot a map of the flight path:
[code language="R"] map_plot = function(from, to){ # get longitude/latitude at origin/destination lat_o <- origin$LAT[origin$CITY_NAME == from] long_o <- origin$LONG[origin$CITY_NAME == from] lat_d <- dest$LAT[origin$CITY_NAME == to] long_d <- dest$LONG[origin$CITY_NAME == to] # create map xlim = c(-125, -62.5) map('state', col = '#f2f2f2', fill = T, xlim = xlim, boundary = T, lty = 0) inter <- gcIntermediate(c(long_o, lat_o), c(long_d, lat_d), n=50, addStartEnd=TRUE) lines(inter, col = 'red', lwd = 2) text(long_o, lat_o, from, col = 'blue', adj = c(-0.1, 1.25)) text(long_d, lat_d, to, col = 'blue', adj = c(-0.1, 1.25)) points(long_d, lat_d, cex = 1.5) points(long_o, lat_o, cex = 1.5) } [/code]
Here I’m using the maps and geosphere packages in R to make a flight path plot for reference that shows up below the data plots in the main panel. For example:
Currently, the main limitation of this application is that I was only able to reasonably use one month of flight data. I used the most recently available USDOT data from May 2015, which contains info on nearly 500,000 flights. In order to include more than a single month I need to be able to handle a very large amount of data (using Hadoop, for example). Getting that part incorporated will be the next step in the development of this app.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.