Introduction to the sfo Package

Rami Krispin

1 year ago

[This article was first published on Rami Krispin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Happy to announce (in a one month delay) a new package – sfo, which provides monthly statistics of the air passenger traffic at San Francisco International (SFO) airport. The package is available on CRAN.

Additional information available:

Source code – https://github.com/RamiKrispin/sfo
Package site – https://ramikrispin.github.io/sfo/index.html
Vignettes – https://ramikrispin.github.io/sfo/articles
Data source – San Francisco Open Data portal

The package provides the following two datasets:

sfo_passengers – air traffic passengers statistics
sfo_stats – air traffic landing statistics

This post focuses on analyzing the sfo_passengers dataset.

sfo_passengers

The sfo_passengers dataset provides monthly statistics about the air traffic passengers in San Francisco International Airport between July 2005 and September 2020. That includes monthly information on the total number of passengers arriving or departing from SFO by airline operator, geographic type (domestic and international), and region (e.g., US, Europe, Mexico, Asia):

library(sfo)

data("sfo_passengers")

str(sfo_passengers)
## 'data.frame':    22576 obs. of  12 variables:
##  $ activity_period            : int  202009 202009 202009 202009 202009 202009 202009 202009 202009 202009 ...
##  $ operating_airline          : chr  "United Airlines" "United Airlines" "United Airlines" "United Airlines" ...
##  $ operating_airline_iata_code: chr  "UA" "UA" "UA" "UA" ...
##  $ published_airline          : chr  "United Airlines" "United Airlines" "United Airlines" "United Airlines" ...
##  $ published_airline_iata_code: chr  "UA" "UA" "UA" "UA" ...
##  $ geo_summary                : chr  "International" "International" "International" "International" ...
##  $ geo_region                 : chr  "Mexico" "Mexico" "Mexico" "Mexico" ...
##  $ activity_type_code         : chr  "Enplaned" "Enplaned" "Enplaned" "Deplaned" ...
##  $ price_category_code        : chr  "Other" "Other" "Other" "Other" ...
##  $ terminal                   : chr  "Terminal 3" "Terminal 3" "International" "International" ...
##  $ boarding_area              : chr  "F" "E" "G" "G" ...
##  $ passenger_count            : int  6712 396 376 6817 3851 3700 71 83 65 45 ...

This dataset enables us to answer questions about the passengers’ flights characteristics at SFO, such as:

How many passengers traveling every month at SFO airport
What is the airlines’ operators distribution
Flights geographic distribution

The examples below demonstrated how to summarise and visualize the data to answer those types of questions.

Monthly number of passengers

To see the greater picture, it is always nice to start with a time-series view of the total number of passengers at SFO. Before starting, let’s reformat the activity_period variable into a Date object:

sfo_passengers$date <- as.Date(paste(substr(sfo_passengers$activity_period, 1,4), 
                                     substr(sfo_passengers$activity_period, 5,6), 
                                     "01", sep ="/"))

We set the first day of the month as the default day of the date object.

Next, we will use the dplyr package to aggregate the data by the date object, and visualize the output with the plotly package:

library(dplyr)
library(plotly)

df_total <- sfo_passengers %>%
  group_by(date) %>%
  summarise(total = sum(passenger_count))

head(df_total)
## # A tibble: 6 x 2
##   date         total
##   <date>       <int>
## 1 2005-07-01 3225769
## 2 2005-08-01 3195866
## 3 2005-09-01 2740553
## 4 2005-10-01 2770715
## 5 2005-11-01 2617333
## 6 2005-12-01 2671797
plot_ly(data = df_total,
        x = ~ date,
        y = ~ total,
        type = "scatter",
        mode = "lines") %>%
  layout(title = "Monthly Air Traffic Passengers at SFO",
         yaxis = list(title = "Number of Passengers"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)"))

As can be observed in the plot above, as expected, the Covid19 significantly impacted the flights to SFO.

Likewise, we can aggregate the data by the activity type (arrival, departure, transit):

sfo_passengers %>%
  group_by(date, activity_type_code) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  as.data.frame() %>%
  plot_ly(x = ~ date,
          y = ~ total,
          color = ~ activity_type_code,
          type = "scatter",
          mode = "lines") %>%
  layout(title = "Monthly Air Traffic Passengers at SFO by Activity Type",
         yaxis = list(title = "Number of Passengers"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)"),
         legend = list(x = 0.1, y = 0.9))

Airlines’ operators summary

The dataset provides two indicators about the airline company:

operating_airline filed describes the aircraft operator name, and
published_airline filed describes the name of the airline that issues the ticket and books revenue

In most of the cases it is the same company, in other cases, such as code-sharing operator it may two different companies. Let’s review the characteristics of those variables:

length(unique(sfo_passengers$operating_airline))
## [1] 94
length(unique(sfo_passengers$published_airline))
## [1] 84

You can notice that there are fewer published airlines compared to operating airlines. This potentially means that some operating airlines have lines with code sharing. For simplicity, we will focus on the operating_airline variable.

As 2020 was no ordinary year for most of us, and in particular for the airline industry, we will focus use 2019 to explore the airline operators’ distribution:

d19 <- sfo_passengers %>% 
  filter(activity_period >= 201901 & activity_period < 202001) %>%
  group_by(published_airline, geo_summary) %>%
  summarise(total = sum(passenger_count), .groups = "drop")

head(d19)
## # A tibble: 6 x 3
##   published_airline                        geo_summary     total
##   <chr>                                    <chr>           <int>
## 1 ABC Aerolineas S.A. de C.V. dba Interjet International  120633
## 2 Aer Lingus                               International  179430
## 3 Aeromexico                               International  339192
## 4 Air Canada                               International 1123314
## 5 Air China                                International  257998
## 6 Air France                               International  359462

Before we summarize the data by airlines, let’s see the distribution of flights by geo type (domestic vs. international):

d19 %>% group_by(geo_summary) %>%
  summarise(total_geo = sum(total)) %>%
  plot_ly(labels = ~ geo_summary,
          values = ~ total_geo) %>%
  add_pie(hole = 0.6)

Close 75% of the air passenger traffic in SFO during 2019 were domestic. That is 42M passengers out of 57.4M passengers. Let’s use a treemap plot to see the distribution of passengers by airline and geo:

 plot_ly(data = d19 %>% filter(geo_summary == "Domestic"),
          labels = ~ published_airline, 
          values = ~ total,
          type = "treemap",
          parents=  ~geo_summary,
          domain = list(row=0),
          name = "Domestic",
          textinfo="label+value+percent parent") %>%
    add_trace(data = d19 %>% filter(geo_summary == "International"),
            labels = ~ published_airline, 
            values = ~ total,
            type = "treemap",
            parents=  ~geo_summary,
            domain = list(row=1),
            name = "International",
            textinfo="label+value+percent parent") %>%
    layout(title = "Passengers Distribution During 2019 by Geo",
      grid=list(columns=1, rows=2))

Air Traffic Summary

Finally, it would be interesting to leverage the different categorical variables to create a summary plot with Sankey. We will use the sankey_ly function, a plotly wrapper that automatically creates the data transformation, and a Sankey plot with plotly. We will rank the top 20 airlines during 2019 and plot the passengers’ distribution by the airline, travel type (domestic or international), travel geo, activity type (deplaned, enplaned, and transit), and fare type (low or other):

rank <- sfo_passengers %>%
  filter(activity_period >= 201901 & activity_period <= 201912) %>%
  group_by(operating_airline) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  arrange(-total) %>%
  slice_head(n = 20) %>%
  select(operating_airline)


sfo_passengers %>%
  filter(activity_period >= 201901 & activity_period <= 201912, operating_airline %in% rank$operating_airline) %>%
  group_by(operating_airline, geo_summary, geo_region, activity_type_code, 
           price_category_code, terminal) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  sankey_ly(cat_cols = c("operating_airline","geo_summary",
                      "geo_region", "activity_type_code",
                      "price_category_code"),
          num_col = "total",
          title = "SFO Air Traffic Passenger Dist. During 2019")

To leave a comment for the author, please follow the link and comment on their blog: Rami Krispin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.