Creating nice tables using R Markdown
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the neat tools available via a variety of packages in R is the creation of beautiful tables using data frames stored in R. In what follows, I’ll discuss these different options using data on departing flights from Seattle and Portland in 2014. (More information and the source code for this R package is available at https://github.com/ismayc/pnwflights14.)
We begin by ensuring the needed packages are installed and then load them into our R session.
# List of packages required for this analysis pkg <- c("dplyr", "knitr", "devtools", "DT", "xtable") # Check if packages are not installed and assign the # names of the packages not installed to the variable new.pkg new.pkg <- pkg[!(pkg %in% installed.packages())] # If there are any packages in the list that aren't installed, # install them if (length(new.pkg)) { install.packages(new.pkg, repos = "http://cran.rstudio.com") } # Load the packages into R library(dplyr) library(knitr) library(DT) library(xtable) # Install Chester's pnwflights14 package (if not already) if (!require(pnwflights14)){ library(devtools) devtools::install_github("ismayc/pnwflights14") } library(pnwflights14) # Load the flights dataset data("flights", package = "pnwflights14")
The dataset provides for the development of a lot of interesting questions. Here I will delve further into some of the questions I addressed in two recent workshops I led in the Fall 2015 Data @ Reed Research Skills Workshop Series. (Slides available at http://rpubs.com/cismay.)
The questions I will analyze by creating tables are
- Which destinations had the worst arrival delays (on average) from the two PNW airports?
- How does the maximum departure delay vary by month for each of the two airports?
- How many flights departed for each airline from each of the airports?
The kable
function in the knitr
package
To address the first question, we will use the dplyr
package written by Hadley Wickham as below. We’ll use the top_n
function to isolate the 5 worst mean arrival delays.
worst_arr_delays <- flights %>% group_by(dest) %>% summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% arrange(desc(mean_arr_delay)) %>% top_n(n = 5, wt = mean_arr_delay)
This information is helpful but you may not necessarily know to which airport each of these FAA airport codes refers. One of the other data sets included in the pnwflights14
package is airports
that lists the names. Here we will do a match to identify the names of these airports using the inner_join
function in dplyr
.
data("airports", package = "pnwflights14") joined_worst <- inner_join(worst_arr_delays, airports, by = c("dest" = "faa")) %>% select(name, dest, mean_arr_delay) %>% rename("Airport Name" = name, "Airport Code" = dest, "Mean Arrival Delay" = mean_arr_delay)
Lastly we output this table cleanly using the kable
function.
kable(joined_worst)
Airport Name | Airport Code | Mean Arrival Delay |
---|---|---|
Cleveland Hopkins Intl | CLE | 26.150000 |
William P Hobby | HOU | 10.250000 |
Metropolitan Oakland Intl | OAK | 10.067460 |
San Francisco Intl | SFO | 8.864937 |
Bellingham Intl | BLI | 8.673913 |
Oddly enough, flights to Cleveland (from PDX and SEA) had the worst arrival delays in 2014. Houston also had around a 10 minute delay on average. Surprisingly, the airport in Bellingham, WA (only around 100 miles north of SEA) had the fifth largest mean arrival delay.
The DT package
In order to answer the second question, we’ll again make use of the various functions in the dplyr
package.
dep_delays_by_month <- flights %>% group_by(origin, month) %>% summarize(max_delay = max(dep_delay, na.rm = TRUE))
The DT
package provides a nice interface for viewing data frames in R. I’ve specified a few extra options here to show all 12 months by default and to automatically set the width. Go ahead and play around with the filter boxes at the top of each column too. (An excellent tutorial on DT
is available at https://rstudio.github.io/DT/.)
datatable(dep_delays_by_month, filter = 'top', options = list( pageLength = 12, autoWidth = TRUE ))
The created table in HTML is available here.
If you click on the max_delay
column header, you should see that the maximum departure delay for PDX was in March and for Seattle was in May.
The xtable
package to produce nice tables in a PDF
Again, we find ourselves using the extremely helpful dplyr
package to answer this question and to create the underpinnings of our table to display. We merge the flights data with the airlines
data to get the names of the airlines from the two letter carrier code.
data("airlines", package = "pnwflights14") by_airline <- flights %>% group_by(origin, carrier) %>% summarize(count = n()) %>% inner_join(x = ., y = airlines, by = "carrier") %>% arrange(desc(count))
The xtable
package and its xtable
function (and also the kable
function you saw earlier) provide the functionality to generate HTML code or LaTeX code to produce a table. We will focus on producing the LaTeX code in this example.
print(xtable(by_airline), comment = FALSE)
begin{table}[ht]
centering
begin{tabular}{rllrl}
hline
& origin & carrier & count & name
hline
1 & PDX & AS & 12844 & Alaska Airlines Inc.
2 & PDX & WN & 11193 & Southwest Airlines Co.
3 & PDX & OO & 9841 & SkyWest Airlines Inc.
4 & PDX & UA & 6061 & United Air Lines Inc.
5 & PDX & DL & 5168 & Delta Air Lines Inc.
6 & PDX & US & 2361 & US Airways Inc.
7 & PDX & AA & 2187 & American Airlines Inc.
8 & PDX & F9 & 1362 & Frontier Airlines Inc.
9 & PDX & B6 & 1287 & JetBlue Airways
10 & PDX & VX & 666 & Virgin America
11 & PDX & HA & 365 & Hawaiian Airlines Inc.
12 & SEA & AS & 49616 & Alaska Airlines Inc.
13 & SEA & WN & 12162 & Southwest Airlines Co.
14 & SEA & DL & 11548 & Delta Air Lines Inc.
15 & SEA & UA & 10610 & United Air Lines Inc.
16 & SEA & OO & 8869 & SkyWest Airlines Inc.
17 & SEA & AA & 5399 & American Airlines Inc.
18 & SEA & US & 3585 & US Airways Inc.
19 & SEA & VX & 2606 & Virgin America
20 & SEA & B6 & 2253 & JetBlue Airways
21 & SEA & F9 & 1336 & Frontier Airlines Inc.
22 & SEA & HA & 730 & Hawaiian Airlines Inc.
hline
end{tabular}
end{table}
If you don’t know LaTeX, I’ve also duplicated a similar table using kable
for you to compare:
kable(by_airline)
origin | carrier | count | name |
---|---|---|---|
PDX | AS | 12844 | Alaska Airlines Inc. |
PDX | WN | 11193 | Southwest Airlines Co. |
PDX | OO | 9841 | SkyWest Airlines Inc. |
PDX | UA | 6061 | United Air Lines Inc. |
PDX | DL | 5168 | Delta Air Lines Inc. |
PDX | US | 2361 | US Airways Inc. |
PDX | AA | 2187 | American Airlines Inc. |
PDX | F9 | 1362 | Frontier Airlines Inc. |
PDX | B6 | 1287 | JetBlue Airways |
PDX | VX | 666 | Virgin America |
PDX | HA | 365 | Hawaiian Airlines Inc. |
SEA | AS | 49616 | Alaska Airlines Inc. |
SEA | WN | 12162 | Southwest Airlines Co. |
SEA | DL | 11548 | Delta Air Lines Inc. |
SEA | UA | 10610 | United Air Lines Inc. |
SEA | OO | 8869 | SkyWest Airlines Inc. |
SEA | AA | 5399 | American Airlines Inc. |
SEA | US | 3585 | US Airways Inc. |
SEA | VX | 2606 | Virgin America |
SEA | B6 | 2253 | JetBlue Airways |
SEA | F9 | 1336 | Frontier Airlines Inc. |
SEA | HA | 730 | Hawaiian Airlines Inc. |
With the originating airport duplicating across all of the airlines, it would be nice if we could reduce this duplication and just bold PDX or SEA and have each appear once. Awesomely enough, the rle
function in R will be of great help to us in this endeavor. It counts how many times a value is repeated in a table. We will then make a call to the multirow
function in LaTeX in a sneaky way of pasting the appropriate text in addition to using the force
option for sanitizing the text into LaTeX.
We add in a few options to make the output of the table a little nicer by specifying horizontal lines and removing the default rownames.
rle.lengths <- rle(by_airline$origin)$lengths first <- !duplicated(by_airline$origin) by_airline$origin[!first] <- "" by_airline$origin[first] <- paste("\multirow{", rle.lengths, "}{*}{\textbf{", by_airline$origin[first], "}}") print(xtable(by_airline), comment = FALSE, hline.after=c(-1,0,nrow(by_airline), 11), sanitize.text.function = force, include.rownames = FALSE)
The resulting table produced by LaTeX can be found at Overleaf.com at https://www.overleaf.com/read/wvrpxpwrbvnk.
We see that Alaska Airlines had the most flights out of both airports with Southwest coming in second at both airports.
(The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.