Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is because I am not a graph analysis expert that I though it important to write this article. For someone who thinks in terms of single rectangular data sets, it is a bit of a mental leap to understand how to apply tidy principles to a more robust object, such as a graph table. Thankfully, there are two packages that make this work much easier:
tidygraph
– Provides a way fordplyr
to interact with graphsggraph
– Extension toggplot2
for graph analysis
Quick intro
Simply put, graph theory studies relationships between objects in a group. Visually, we can think of a graph as a series of interconnected circles, each representing a member of a group, such as people in a Social Network. Lines drawn between the circles represent a relationship between the members, such as friendships in a Social Network. Graph analysis helps with figuring out things such as the influence of a certain member, or how many friends are in between two members. A more formal definition and detailed explanation of Graph Theory can be found in Wikipedia here.
Example
Using an example, this article will introduce concepts of graph analysis work, and how tidyverse
and tidyverse
-adjacent tools can be used for such analysis.
Data source
The tidytuesday weekly project encourages new and experienced users to use the tidyverse
tools to analyze data sets that change every week. I have been using that opportunity to lean new tools and techniques. One of the most recent data sets relates to French trains; it contains aggregate daily total trips per connecting stations.
library(readr) url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-26/small_trains.csv" small_trains <- read_csv(url) head(small_trains) ## # A tibble: 6 x 13 ## year month service departure_stati… arrival_station journey_time_avg ## <int> <int> <chr> <chr> <chr> <dbl> ## 1 2017 9 Nation… PARIS EST METZ 85.1 ## 2 2017 9 Nation… REIMS PARIS EST 47.1 ## 3 2017 9 Nation… PARIS EST STRASBOURG 116. ## 4 2017 9 Nation… PARIS LYON AVIGNON TGV 161. ## 5 2017 9 Nation… PARIS LYON BELLEGARDE (AI… 164. ## 6 2017 9 Nation… PARIS LYON BESANCON FRANC… 129. ## # … with 7 more variables: total_num_trips <int>, ## # avg_delay_all_departing <dbl>, avg_delay_all_arriving <dbl>, ## # num_late_at_departure <int>, num_arriving_late <int>, ## # delay_cause <chr>, delayed_number <dbl>
Data Preparation
Even though it was meant to analyze delays, I thought it would be interesting to use the data to understand how stations connect with each other. A new summarized data set is created, called routes, which contains a single entry for each connected station. It also includes the average journey time it takes to go between stations.
library(dplyr) routes <- small_trains %>% group_by(departure_station, arrival_station) %>% summarise(journey_time = mean(journey_time_avg)) %>% ungroup() %>% mutate(from = departure_station, to = arrival_station) %>% select(from, to, journey_time) routes ## # A tibble: 130 x 3 ## from to journey_time ## <chr> <chr> <dbl> ## 1 AIX EN PROVENCE TGV PARIS LYON 186. ## 2 ANGERS SAINT LAUD PARIS MONTPARNASSE 97.5 ## 3 ANGOULEME PARIS MONTPARNASSE 146. ## 4 ANNECY PARIS LYON 225. ## 5 ARRAS PARIS NORD 52.8 ## 6 AVIGNON TGV PARIS LYON 161. ## 7 BARCELONA PARIS LYON 358. ## 8 BELLEGARDE (AIN) PARIS LYON 163. ## 9 BESANCON FRANCHE COMTE TGV PARIS LYON 131. ## 10 BORDEAUX ST JEAN PARIS MONTPARNASSE 186. ## # … with 120 more rows
The next step is to transform the tidy data set, into a graph table. In order to prepare routes for this transformation, it has to contain two variables specifically named: from and to, which are the names that tidygraph
expects to see. Those variables should contain the name of each member (e.g., “AIX EN PROVENCE TGV”), and the relationship (“AIX EN PROVENCE TGV” -> “PARIS LYON”) .
In graph terminology, a member of the group is called a node (or vertex) in the graph, and a relationship between nodes is called an edge.
library(tidygraph) graph_routes <- as_tbl_graph(routes) graph_routes ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 1 (active) ## name ## <chr> ## 1 AIX EN PROVENCE TGV ## 2 ANGERS SAINT LAUD ## 3 ANGOULEME ## 4 ANNECY ## 5 ARRAS ## 6 AVIGNON TGV ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## <int> <int> <dbl> ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows
The as_tbl_graph()
function splits the routes table into two:
Node Data – Contains all of the unique values found in the from and to variables. In this case, it is a table with a single column containing the names of all of the stations.
Edge Data – Is a table of all relationships between from and to. A peculiarity of
tidygraph
is that it uses the row position of the node as the identifier for from and to, instead of its original name.
Another interesting thing about tidygraph
is that it allows us to attach more information about the node or edge in an additional column. In this case, journey_time is not really needed to create the graph table, but it may be needed for the analysis we plan to perform. The as_tbl_graph()
function automatically created the column for us.
Thinking about graph_routes as two tibbles
inside a larger table graph, was one of the two major mental breakthroughs I had during this exercise. At that point, it became evident that dplyr
needs a way to know which of the two tables (nodes or edges) to perform the transformations on. In tidygraph
, this is done using the activate()
function. To showcase this, the nodes table will be “activated” in order to add two new string variables derived from name.
library(stringr) graph_routes <- graph_routes %>% activate(nodes) %>% mutate( title = str_to_title(name), label = str_replace_all(title, " ", "\n") ) graph_routes ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 3 (active) ## name title label ## <chr> <chr> <chr> ## 1 AIX EN PROVENCE TGV Aix En Provence Tgv "Aix\nEn\nProvence\nTgv" ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" ## 3 ANGOULEME Angouleme Angouleme ## 4 ANNECY Annecy Annecy ## 5 ARRAS Arras Arras ## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv" ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## <int> <int> <dbl> ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows
It was really impressive how easy it was to manipulate the graph table, because once one of the two tables are activated, all of the changes can be made using tidyverse
tools. The same approach can be used to extract data from the graph table. In this case, a list of all the stations is pulled into a single character vector.
stations <- graph_routes %>% activate(nodes) %>% pull(title) stations ## [1] "Aix En Provence Tgv" "Angers Saint Laud" ## [3] "Angouleme" "Annecy" ## [5] "Arras" "Avignon Tgv" ## [7] "Barcelona" "Bellegarde (Ain)" ## [9] "Besancon Franche Comte Tgv" "Bordeaux St Jean" ## [11] "Brest" "Chambery Challes Les Eaux" ## [13] "Dijon Ville" "Douai" ## [15] "Dunkerque" "Francfort" ## [17] "Geneve" "Grenoble" ## [19] "Italie" "La Rochelle Ville" ## [21] "Lausanne" "Laval" ## [23] "Le Creusot Montceau Montchanin" "Le Mans" ## [25] "Lille" "Lyon Part Dieu" ## [27] "Macon Loche" "Madrid" ## [29] "Marne La Vallee" "Marseille St Charles" ## [31] "Metz" "Montpellier" ## [33] "Mulhouse Ville" "Nancy" ## [35] "Nantes" "Nice Ville" ## [37] "Nimes" "Paris Est" ## [39] "Paris Lyon" "Paris Montparnasse" ## [41] "Paris Nord" "Paris Vaugirard" ## [43] "Perpignan" "Poitiers" ## [45] "Quimper" "Reims" ## [47] "Rennes" "Saint Etienne Chateaucreux" ## [49] "St Malo" "St Pierre Des Corps" ## [51] "Strasbourg" "Stuttgart" ## [53] "Toulon" "Toulouse Matabiau" ## [55] "Tourcoing" "Tours" ## [57] "Valence Alixan Tgv" "Vannes" ## [59] "Zurich"
Visualizing
In graphs, the absolute position of the each node is not as relevant as it is with other kinds of visualizations. A very minimal ggplot2
theme is set to make it easier to view the plotted graph.
library(ggplot2) thm <- theme_minimal() + theme( legend.position = "none", axis.title = element_blank(), axis.text = element_blank(), panel.grid = element_blank(), panel.grid.major = element_blank(), ) theme_set(thm)
To create the plot, start with ggraph()
instead of ggplot2()
. The ggraph
package contains geoms
that are unique to graph analysis. The package contains geoms
to specifically plot nodes, and other geoms
for edges.
As a first basic test, the point geom
will be used, but instead of callinggeom_point()
, we call geom_node_point()
. The edges are plotted using geom_edge_diagonal()
.
library(ggraph) graph_routes %>% ggraph(layout = "kk") + geom_node_point() + geom_edge_diagonal()
To make it easier to see where each station is placed in this plot, the geom_node_text()
is used. Just as with regular geoms
in ggplot2
, other attributes such as size
, color
, and alpha
can be modified.
graph_routes %>% ggraph(layout = "kk") + geom_node_text(aes(label = label, color = name), size = 3) + geom_edge_diagonal(color = "gray", alpha = 0.4)
Morphing time!
The second mental leap was understanding how a graph algorithm is applied. Typically, the output of a model function is a model object, not a data object. With tidygraph
, the process begins and ends with a graph table. The steps are these:
- Start with a graph table
- Temporarily transform the graph to comply with the model that is requested (
morph()
) - Add additional transformations to the morphed data using
dplyr
(optional) - Restore the original graph table, but modified to keep the changes made during the morph
The shortest path algorithm defines the “length” as the number of edges in between two nodes. There may be multiple routes to get from point A to point B, but the algorithm chooses the one with the fewest number of “hops”. The way to call the algorithm is inside the morph()
function. Even though to_shortest_path()
is a function in itself, and it is possible run it without morph()
, it is not meant to be used that way. In the example, the journey_time is used as weights
to help the algorithm find an optimal route between the Arras and the Nancy stations. The print output of the morphed graph will not be like the original graph table.
from <- which(stations == "Arras") to <- which(stations == "Nancy") shortest <- graph_routes %>% morph(to_shortest_path, from, to, weights = journey_time) shortest ## # A tbl_graph temporarily morphed to a shortest path representation ## # ## # Original graph is a directed simple graph with 1 component ## # consisting of 59 nodes and 130 edges
It is possible to make more transformations with the use of activate()
and dplyr
functions. The results can be previewed, or committed back to the original R variable using unmorph()
. By default, nodes are active in a morphed graph, so there is no need to set that explicitly.
shortest %>% mutate(selected_node = TRUE) %>% unmorph() ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Node Data: 59 x 4 (active) ## name title label selected_node ## <chr> <chr> <chr> <lgl> ## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… NA ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" NA ## 3 ANGOULEME Angouleme Angouleme NA ## 4 ANNECY Annecy Annecy NA ## 5 ARRAS Arras Arras TRUE ## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv" NA ## # … with 53 more rows ## # ## # Edge Data: 130 x 3 ## from to journey_time ## <int> <int> <dbl> ## 1 1 39 186. ## 2 2 40 97.5 ## 3 3 40 146. ## # … with 127 more rows
While it was morphed, only the few nodes that make up the connections between the Arras and Nancy stations were selected. A simple mutate()
adds a new variable called selected_node, which tags those nodes with TRUE. The new variable and value is retained once the rest of the nodes are restored via the unmorph()
command.
To keep the change, the shortest variable is updated with the changes made to both edges and nodes.
shortest <- shortest %>% mutate(selected_node = TRUE) %>% activate(edges) %>% mutate(selected_edge = TRUE) %>% unmorph()
The next step is to coerce each NA into a 1, and the shortest route into a 2. This will allow us to easily re-arrange the order that the edges are drawn in the plot, ensuring that the route will be drawn at the top.
shortest <- shortest %>% activate(nodes) %>% mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>% activate(edges) %>% mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>% arrange(selected_edge) shortest ## # A tbl_graph: 59 nodes and 130 edges ## # ## # A directed simple graph with 1 component ## # ## # Edge Data: 130 x 4 (active) ## from to journey_time selected_edge ## <int> <int> <dbl> <dbl> ## 1 1 39 186. 1 ## 2 2 40 97.5 1 ## 3 3 40 146. 1 ## 4 4 39 225. 1 ## 5 6 39 161. 1 ## 6 7 39 358. 1 ## # … with 124 more rows ## # ## # Node Data: 59 x 4 ## name title label selected_node ## <chr> <chr> <chr> <dbl> ## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… 1 ## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" 1 ## 3 ANGOULEME Angouleme Angouleme 1 ## # … with 56 more rows
A simple way to plot the route is to use the selected_ variables to modify the alpha
. This will highlight the shortest path, without completely removing the other stations. This is a personal design choice, so experimenting with different ways of highlighting the results is always recommended.
shortest %>% ggraph(layout = "kk") + geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") + geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)
The selected_ fields can also be used in other dplyr
functions to analyze the results. For example, to know the aggregate information about the trip, selected_edge is used to filter the edges, and then the totals can be calculated. There is no summarise()
function for graph tables; this make sense because the graph table would become a summarized table with such a function. Since the end result we seek is a total rather than another graph table, a simple as_tibble()
command will coerce the edges, which will then allows us to finish the calculation.
shortest %>% activate(edges) %>% filter(selected_edge == 2) %>% as_tibble() %>% summarise( total_stops = n() - 1, total_time = round(sum(journey_time) / 60) ) ## # A tibble: 1 x 2 ## total_stops total_time ## <dbl> <dbl> ## 1 8 23
Re-using the code
To compile most of the code in a single chunk, here is an example of how to re-run the shortest path for a different set of stations: the Laval and Montpellier stations.
from <- which(stations == "Montpellier") to <- which(stations == "Laval") shortest <- graph_routes %>% morph(to_shortest_path, from, to, weights = journey_time) %>% mutate(selected_node = TRUE) %>% activate(edges) %>% mutate(selected_edge = TRUE) %>% unmorph() %>% activate(nodes) %>% mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>% activate(edges) %>% mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>% arrange(selected_edge) shortest %>% ggraph(layout = "kk") + geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") + geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)
Additional, the same code can be recycled to obtain the trip summarized data.
shortest %>% activate(edges) %>% filter(selected_edge == 2) %>% as_tibble() %>% summarise( total_stops = n() - 1, total_time = round(sum(journey_time) / 60) ) ## # A tibble: 1 x 2 ## total_stops total_time ## <dbl> <dbl> ## 1 3 10
Shiny app
To see how to use this kind of analysis inside Shiny, please refer to this application. It lets the user select two stations, and it returns the route, plus the summarized data. The source code is embedded in the app.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.