Site icon R-bloggers

Spatial networks in R with sf and tidygraph

[This article was first published on r-spatial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Spatial networks in R with sf and tidygraph

Lucas van der Meer, Robin Lovelace & Lorena Abad September 26, 2019

Introduction

Street networks, shipping routes, telecommunication lines, river bassins. All examples of spatial networks: organized systems of nodes and edges embedded in space. For most of them, these nodes and edges can be associated with geographical coordinates. That is, the nodes are geographical points, and the edges geographical lines.

Such spatial networks can be analyzed using graph theory. Not for nothing, Leonhard Eulers famous work on the Seven Bridges of Köningsberg, which laid the foundations of graph theory and network analysis, was in essence a spatial problem.

In R, there are advanced, modern tools for both the analysis of spatial data and networks. Furthermore, several packages have been developed that cover (parts of) spatial network analysis. As stated in this github issue and this tweet, concensus on how best to represent and analyse spatial networks has proved elusive.

This blogpost demonstrates an approach to spatial networks that starts with a set of geographic lines, and leads to an object ready to be used for network analysis. Along the way, we will see that there are still steps that need to be taken before the process of analyzing spatial networks in R is user friendly and efficient.

Existing R packages for spatial networks

Although R was originally designed as a language for statistical computing, an active ‘R-spatial’ ecosystem has evolved. Powerful and high performance packages for spatial data analysis have been developed, thanks largely to interfaces to mature C/C++ libraries such as GDAL, GEOS and PROJ, notably in the package sf (see section 1.5 of Geocomputation with R for a brief history). Likewise, a number of packages for graph representation and analysis have been developed, notably tidygraph, which is based on igraph.

Both sf and tidygraph support the tibble class and the broader ‘tidy’ approach to data science, which involves data processing pipelines, type stability and a convention of representing everything as a data frame (well a tibble, which is a data frame with user friendly default settings). In sf, this means storing spatial vector data as objects of class sf, which are essentially the same as a regular data frame (or tibble), but with an additional ‘sticky’ list column containing a geometry for each feature (row), and attributes such as bounding box and CRS. Tidygraph stores networks in objects of class tbl_graph. A tbl_graph is an igraph object, but enables the user to manipulate both the edges and nodes elements as if they were data frames also.

Both sf and tidygraph are relatively new packages (first released on CRAN in 2016 and 2017, respectively). It is unsurprising, therefore, that they have yet to be combined to allow a hybrid, tibble-based representation of spatial networks.

Nevertheless, a number of approaches have been developed for representing spatial networks, and some of these are in packages that have been published on CRAN. stplanr, for instance, contains the SpatialLinesNetwork class, which works with both the sp (a package for spatial data analysis launched in 2005) and sf packages. dodgr is a more recent package that provides analytical tools for street networks, with a focus on directed graphs (that can have direction-dependent weights, e.g. representing a one-way street). Other packages seeking to implement spatial networks in R include spnetwork, a package that defined a class system combining sp and igraph, and shp2graph, which provides tools to switch between sp and igraph objects.

Each package has its merits that deserve to be explored in more detail (possibly in a subsequent blog post). The remainder of this post outlines an approach that combines sf and igraph objects in a tidygraph object.

Set-up

The following code chunk will install the packages used in this post:

# We'll use remotes to install packages, install it if needs be:
if(!"remotes" %in% installed.packages()) {
  install.packages("remotes")
}

cran_pkgs = c(
  "sf",
  "tidygraph",
  "igraph",
  "osmdata",
  "dplyr",
  "tibble",
  "ggplot2",
  "units",
  "tmap",
  "rgrass7",
  "link2GI",
  "nabor"
)

remotes::install_cran(cran_pkgs)

library(sf)
library(tidygraph)
library(igraph)
library(dplyr)
library(tibble)
library(ggplot2)
library(units)
library(tmap)
library(osmdata)
library(rgrass7)
library(link2GI)
library(nabor)

Getting the data

As an example, we use the street network of the city center of Münster, Germany. We will get the data from OpenStreetMap. Packages like dodgr have optimized their code for such data, however considering that we want to showcase this workflow for any source of data, we will generate an object of class sf containing only LINESTRING geometries. However, streets that form loops, are returned by osmdata as polygons, rather than lines. These, we will convert to lines, using the osm_poly2line function. One additional variable, the type of street, is added to show that the same steps can be used for sf objects that contain any number of additional variables.

muenster <- opq(bbox =  c(7.61, 51.954, 7.636, 51.968)) %>% 
  add_osm_feature(key = 'highway') %>% 
  osmdata_sf() %>% 
  osm_poly2line()

muenster_center <- muenster$osm_lines %>% 
  select(highway)

muenster_center

## Simple feature collection with 2233 features and 1 field
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##             highway                       geometry
## 4064462     primary LINESTRING (7.624554 51.955...
## 4064463     primary LINESTRING (7.626498 51.956...
## 4064467 residential LINESTRING (7.630898 51.955...
## 4064474     primary LINESTRING (7.61972 51.9554...
## 4064476     primary LINESTRING (7.619844 51.954...
## 4064482    tertiary LINESTRING (7.616395 51.957...
## 4064485     service LINESTRING (7.63275 51.9603...
## 4984982   secondary LINESTRING (7.614156 51.967...
## 4985138    cycleway LINESTRING (7.61525 51.9673...
## 4985140 residential LINESTRING (7.616774 51.968...

ggplot(data = muenster_center) + geom_sf()

< !-- -->

From sf to tbl_graph: a step wise approach

Step 1: Clean the network

To perform network analysis, we need a network with a clean topology. In theory, the best way to clean up the network topology is by manual editing, but this can be very labour intensive and time consuming, mainly for large networks. The v.clean toolset from the GRASS GIS software provides automated functionalities for this task, and is therefore a popular instrument within the field of spatial network analysis. As far as we know, there is no R equivalent for this toolset, but fortunately, the rgrass7 and link2GI packages enable us to easily ‘bridge’ to GRASS GIS. Obviously, this requires to have GRASS GIS installed on your computer. For an in depth description of combining R with open source GIS software, see Chapter 9 of Geocomputation with R. Take into account that the linking process may take up some time, especially on Windows operating systems. Also, note that there have been some large changes recently to the rgrass7 package, which enabled a better integration with sf. However, it also means that the code below will not work when using an older version of rgrass7, so make sure to update if needed.

Here, we will clean the network topology by breaking lines at intersections and also breaking lines that form a collapsed loop. This will be followed by a removal of duplicated geometry features. Once done, we will read the data back into R, and convert again into an sf object with LINESTRING geometry. For this we use rgrass7, which requires some set-up code that can be seen in the source code of this post.

# Add data to GRASS spatial database  
writeVECT(
  SDF = muenster_center, 
  vname = 'muenster_center', 
  v.in.ogr_flags = 'overwrite'
)

# Execute the v.clean tool
execGRASS("g.proj", flags = c("c", "quiet"), proj4 = proj4)
execGRASS(
  cmd = 'v.clean', 
  input = 'muenster_center', 
  output = 'muenster_cleaned',        
  tool = 'break', 
  flags = c('overwrite', 'c')
)

# Read back into R
use_sf()
muenster_center <- readVECT('muenster_cleaned') %>%
  rename(geometry = geom) %>%
  select(-cat)

muenster_center

## Simple feature collection with 4667 features and 1 field
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##        highway                       geometry
## 1      service LINESTRING (7.63275 51.9603...
## 2    secondary LINESTRING (7.614156 51.967...
## 3     cycleway LINESTRING (7.61525 51.9673...
## 4      footway LINESTRING (7.629304 51.967...
## 5        steps LINESTRING (7.627696 51.965...
## 6      service LINESTRING (7.633612 51.965...
## 7  residential LINESTRING (7.630564 51.957...
## 8      service LINESTRING (7.613545 51.960...
## 9     cycleway LINESTRING (7.619781 51.957...
## 10 residential LINESTRING (7.62373 51.9643...

Step 2: Give each edge a unique index

The edges of the network, are simply the linestrings in the data. Each of them gets a unique index, which can later be related to their start and end node.

edges <- muenster_center %>%
  mutate(edgeID = c(1:n()))

edges

## Simple feature collection with 4667 features and 2 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##        highway                       geometry edgeID
## 1      service LINESTRING (7.63275 51.9603...      1
## 2    secondary LINESTRING (7.614156 51.967...      2
## 3     cycleway LINESTRING (7.61525 51.9673...      3
## 4      footway LINESTRING (7.629304 51.967...      4
## 5        steps LINESTRING (7.627696 51.965...      5
## 6      service LINESTRING (7.633612 51.965...      6
## 7  residential LINESTRING (7.630564 51.957...      7
## 8      service LINESTRING (7.613545 51.960...      8
## 9     cycleway LINESTRING (7.619781 51.957...      9
## 10 residential LINESTRING (7.62373 51.9643...     10

Step 3: Create nodes at the start and end point of each edge

The nodes of the network, are the start and end points of the edges. The locations of these points can be derived by using the st_coordinates function in sf. When given a set of linestrings, this function breaks down each of them into the points they are built up from. It returns a matrix with the X and Y coordinates of those points, and additionally an integer indicator L1 specifying to which line a point belongs. These integer indicators correspond to the edge indices defined in step 1. That is, if we convert the matrix into a data.frame or tibble, group the features by the edge index, and only keep the first and last feature of each group, we have the start and end points of the linestrings.

nodes <- edges %>%
  st_coordinates() %>%
  as_tibble() %>%
  rename(edgeID = L1) %>%
  group_by(edgeID) %>%
  slice(c(1, n())) %>%
  ungroup() %>%
  mutate(start_end = rep(c('start', 'end'), times = n()/2))

nodes

## # A tibble: 9,334 x 4
##        X     Y edgeID start_end
##    <dbl> <dbl>  <dbl> <chr>    
##  1  7.63  52.0      1 start    
##  2  7.63  52.0      1 end      
##  3  7.61  52.0      2 start    
##  4  7.61  52.0      2 end      
##  5  7.62  52.0      3 start    
##  6  7.62  52.0      3 end      
##  7  7.63  52.0      4 start    
##  8  7.63  52.0      4 end      
##  9  7.63  52.0      5 start    
## 10  7.63  52.0      5 end      
## # … with 9,324 more rows

Step 4: Give each node a unique index

Each of the nodes in the network needs to get a unique index, such that they can be related to the edges. However, we need to take into account that edges can share either startpoints and/or endpoints. Such duplicated points, that have the same X and Y coordinate, are one single node, and should therefore get the same index. Note that the coordinate values as displayed in the tibble are rounded, and may look the same for several rows, even when they are not. We can use the group_indices function in dplyr to give each group of unique X,Y-combinations a unique index.

nodes <- nodes %>%
  mutate(xy = paste(.$X, .$Y)) %>% 
  mutate(nodeID = group_indices(., factor(xy, levels = unique(xy)))) %>%
  select(-xy)

nodes

## # A tibble: 9,334 x 5
##        X     Y edgeID start_end nodeID
##    <dbl> <dbl>  <dbl> <chr>      <int>
##  1  7.63  52.0      1 start          1
##  2  7.63  52.0      1 end            2
##  3  7.61  52.0      2 start          3
##  4  7.61  52.0      2 end            4
##  5  7.62  52.0      3 start          5
##  6  7.62  52.0      3 end            6
##  7  7.63  52.0      4 start          7
##  8  7.63  52.0      4 end            8
##  9  7.63  52.0      5 start          9
## 10  7.63  52.0      5 end           10
## # … with 9,324 more rows

Step 5: Combine the node indices with the edges

Now each of the start and endpoints have been assigned a node ID in step 4, so that we can add the node indices to the edges. In other words, we can specify for each edge, in which node it starts, and in which node it ends.

source_nodes <- nodes %>%
  filter(start_end == 'start') %>%
  pull(nodeID)

target_nodes <- nodes %>%
  filter(start_end == 'end') %>%
  pull(nodeID)

edges = edges %>%
  mutate(from = source_nodes, to = target_nodes)

edges

## Simple feature collection with 4667 features and 4 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##        highway                       geometry edgeID from to
## 1      service LINESTRING (7.63275 51.9603...      1    1  2
## 2    secondary LINESTRING (7.614156 51.967...      2    3  4
## 3     cycleway LINESTRING (7.61525 51.9673...      3    5  6
## 4      footway LINESTRING (7.629304 51.967...      4    7  8
## 5        steps LINESTRING (7.627696 51.965...      5    9 10
## 6      service LINESTRING (7.633612 51.965...      6   11 12
## 7  residential LINESTRING (7.630564 51.957...      7   13 14
## 8      service LINESTRING (7.613545 51.960...      8   15 16
## 9     cycleway LINESTRING (7.619781 51.957...      9   17 18
## 10 residential LINESTRING (7.62373 51.9643...     10   19 20

Step 6: Remove duplicate nodes

Having added the unique node ID’s to the edges data, we don’t need the duplicated start and endpoints anymore. After removing them, we end up with a tibble in which each row represents a unique, single node. This tibble can be converted into an sf object, with POINT geometries.

nodes <- nodes %>%
  distinct(nodeID, .keep_all = TRUE) %>%
  select(-c(edgeID, start_end)) %>%
  st_as_sf(coords = c('X', 'Y')) %>%
  st_set_crs(st_crs(edges))

nodes

## Simple feature collection with 3329 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 3,329 x 2
##    nodeID            geometry
##     <int>         <POINT [°]>
##  1      1   (7.63275 51.9603)
##  2      2 (7.631843 51.96061)
##  3      3 (7.614156 51.96724)
##  4      4 (7.613797 51.96723)
##  5      5  (7.61525 51.96736)
##  6      6 (7.615148 51.96745)
##  7      7 (7.629304 51.96712)
##  8      8  (7.629308 51.9673)
##  9      9 (7.627696 51.96534)
## 10     10  (7.62765 51.96534)
## # … with 3,319 more rows

Step 7: Convert to tbl_graph

The first six steps led to one sf object with LINESTRING geometries, representing the edges of the network, and one sf object with POINT geometries, representing the nodes of the network. The tbl_graph function allows us to convert these two into a tbl_graph object. There are two tricky parts in this step that need to be highlighted. One, is that the columns containing the indices of the source and target nodes should either be the first two columns of the sf object, or be named ‘to’ and ‘from’, respectively. Secondly, inside the tbl_graph function, these columns are converted into a two-column matrix. However, an sf object has a so-called ‘sticky geometry’, which means that the geometry column sticks to the attributes whenever specific columns are selected. Therefore, the matrix created inside tbl_graph has three columns instead of two, and that causes an error. Therefore, we first need to convert the sf object to a regular data.frame or tibble, before we can construct a tbl_graph. In the end, this doesn’t matter, since both the nodes and edges will be ‘integrated’ into an igraph structure, and loose their specific sf characteristics.

graph = tbl_graph(nodes = nodes, edges = as_tibble(edges), directed = FALSE)

graph

## # A tbl_graph: 3329 nodes and 4667 edges
## #
## # An undirected multigraph with 34 components
## #
## # Node Data: 3,329 x 2 (active)
##   nodeID            geometry
##    <int>         <POINT [°]>
## 1      1   (7.63275 51.9603)
## 2      2 (7.631843 51.96061)
## 3      3 (7.614156 51.96724)
## 4      4 (7.613797 51.96723)
## 5      5  (7.61525 51.96736)
## 6      6 (7.615148 51.96745)
## # … with 3,323 more rows
## #
## # Edge Data: 4,667 x 5
##    from    to highway                                       geometry edgeID
##   <int> <int> <fct>                                 <LINESTRING [°]>  <int>
## 1     1     2 service           (7.63275 51.9603, 7.631843 51.96061)      1
## 2     3     4 secondary       (7.614156 51.96724, 7.613797 51.96723)      2
## 3     5     6 cycleway  (7.61525 51.96736, 7.615165 51.96744, 7.615…      3
## # … with 4,664 more rows

Step 8: Putting it together

To make the approach more convenient, we can combine all steps above into a single function, that takes a cleaned sf object with LINESTRING geometries as input, and returns a spatial tbl_graph.

sf_to_tidygraph = function(x, directed = TRUE) {
  
  edges <- x %>%
    mutate(edgeID = c(1:n()))
  
  nodes <- edges %>%
    st_coordinates() %>%
    as_tibble() %>%
    rename(edgeID = L1) %>%
    group_by(edgeID) %>%
    slice(c(1, n())) %>%
    ungroup() %>%
    mutate(start_end = rep(c('start', 'end'), times = n()/2)) %>%
    mutate(xy = paste(.$X, .$Y)) %>% 
    mutate(nodeID = group_indices(., factor(xy, levels = unique(xy)))) %>%
    select(-xy)
  
  source_nodes <- nodes %>%
    filter(start_end == 'start') %>%
    pull(nodeID)

  target_nodes <- nodes %>%
    filter(start_end == 'end') %>%
    pull(nodeID)

  edges = edges %>%
    mutate(from = source_nodes, to = target_nodes)
  
  nodes <- nodes %>%
    distinct(nodeID, .keep_all = TRUE) %>%
    select(-c(edgeID, start_end)) %>%
    st_as_sf(coords = c('X', 'Y')) %>%
    st_set_crs(st_crs(edges))
  
  tbl_graph(nodes = nodes, edges = as_tibble(edges), directed = directed)
  
}

sf_to_tidygraph(muenster_center, directed = FALSE)

## # A tbl_graph: 3329 nodes and 4667 edges
## #
## # An undirected multigraph with 34 components
## #
## # Node Data: 3,329 x 2 (active)
##   nodeID            geometry
##    <int>         <POINT [°]>
## 1      1   (7.63275 51.9603)
## 2      2 (7.631843 51.96061)
## 3      3 (7.614156 51.96724)
## 4      4 (7.613797 51.96723)
## 5      5  (7.61525 51.96736)
## 6      6 (7.615148 51.96745)
## # … with 3,323 more rows
## #
## # Edge Data: 4,667 x 5
##    from    to highway                                       geometry edgeID
##   <int> <int> <fct>                                 <LINESTRING [°]>  <int>
## 1     1     2 service           (7.63275 51.9603, 7.631843 51.96061)      1
## 2     3     4 secondary       (7.614156 51.96724, 7.613797 51.96723)      2
## 3     5     6 cycleway  (7.61525 51.96736, 7.615165 51.96744, 7.615…      3
## # … with 4,664 more rows

Combining the best of both worlds

Having the network stored in the tbl_graph structure, with a geometry list column for both the edges and nodes, enables us to combine the wide range of functionalities in sf and tidygraph, in a way that fits neatly into the tidyverse.

With the activate() verb, we specify if we want to manipulate the edges or the nodes. Then, most dplyr verbs can be used in the familiar way, also when directly applied to the geometry list column. For example, we can add a variable describing the length of each edge, which, later, we use as a weight for the edges.

graph <- graph %>%
  activate(edges) %>%
  mutate(length = st_length(geometry))

graph

## # A tbl_graph: 3329 nodes and 4667 edges
## #
## # An undirected multigraph with 34 components
## #
## # Edge Data: 4,667 x 6 (active)
##    from    to highway                              geometry edgeID   length
##   <int> <int> <fct>                        <LINESTRING [°]>  <int>      [m]
## 1     1     2 service  (7.63275 51.9603, 7.631843 51.96061)      1 71.2778…
## 2     3     4 seconda… (7.614156 51.96724, 7.613797 51.967…      2 24.7146…
## 3     5     6 cycleway (7.61525 51.96736, 7.615165 51.9674…      3 12.2303…
## 4     7     8 footway  (7.629304 51.96712, 7.629304 51.967…      4 20.0122…
## 5     9    10 steps    (7.627696 51.96534, 7.62765 51.9653…      5  3.2926…
## 6    11    12 service  (7.633612 51.96548, 7.633578 51.965…      6  7.4291…
## # … with 4,661 more rows
## #
## # Node Data: 3,329 x 2
##   nodeID            geometry
##    <int>         <POINT [°]>
## 1      1   (7.63275 51.9603)
## 2      2 (7.631843 51.96061)
## 3      3 (7.614156 51.96724)
## # … with 3,326 more rows

With one flow of pipes, we can ‘escape’ the graph structure, turn either the edges or nodes back into real sf objects, and, for example, summarise the data based on a specific variable.

graph %>%
  activate(edges) %>%
  as_tibble() %>%
  st_as_sf() %>%
  group_by(highway) %>%
  summarise(length = sum(length))

## Simple feature collection with 17 features and 2 fields
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: 7.603762 ymin: 51.94823 xmax: 7.641418 ymax: 51.97241
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 17 x 3
##    highway         length                                          geometry
##  * <fct>              [m]                             <MULTILINESTRING [°]>
##  1 corridor        9.377… ((7.620506 51.96262, 7.620542 51.96266), (7.6205…
##  2 cycleway    19275.917… ((7.619683 51.95395, 7.619641 51.95378, 7.619559…
##  3 footway     42822.070… ((7.640529 51.95325, 7.640528 51.95323, 7.640531…
##  4 path         8193.341… ((7.624007 51.95379, 7.624223 51.95378, 7.624253…
##  5 pedestrian  11399.337… ((7.620362 51.95471, 7.620477 51.9547), (7.62012…
##  6 primary      3574.842… ((7.625556 51.95272, 7.625594 51.95284, 7.625714…
##  7 primary_li…   184.385… ((7.617285 51.96609, 7.617286 51.96624, 7.617295…
##  8 residential 22729.748… ((7.614509 51.95351, 7.614554 51.95346), (7.6230…
##  9 secondary    4460.683… ((7.629784 51.95446, 7.629842 51.95444), (7.6312…
## 10 secondary_…   160.708… ((7.635309 51.95946, 7.635705 51.95948), (7.6349…
## 11 service     27027.822… ((7.624803 51.95393, 7.625072 51.95393), (7.6158…
## 12 steps        1321.841… ((7.634423 51.9546, 7.634438 51.95462), (7.61430…
## 13 tertiary     4356.365… ((7.607112 51.94991, 7.607126 51.94992, 7.607183…
## 14 tertiary_l…    43.856… ((7.623592 51.96612, 7.623568 51.96611, 7.623468…
## 15 track         389.866… ((7.610671 51.95778, 7.610571 51.95759, 7.610585…
## 16 unclassifi…   612.816… ((7.634492 51.95613, 7.634689 51.95611), (7.6343…
## 17 <NA>         3161.883… ((7.634374 51.95579, 7.634545 51.95575, 7.634662…

Switching back to sf objects is useful as well when plotting the network, in a way that preserves its spatial properties.

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf()) + 
  geom_sf(data = graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf(), size = 0.5)

< !-- -->

Or, alternatively, in only a few lines of code, plot the network as an interactive map. On this page, the interactive map might show as an image, but here, you should be able to really interact with it!

tmap_mode('view')

tm_shape(graph %>% activate(edges) %>% as_tibble() %>% st_as_sf()) +
  tm_lines() +
tm_shape(graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf()) +
  tm_dots() +
tmap_options(basemaps = 'OpenStreetMap')

< !-- -->

All nice and well, but these are not things that we necessarily need the graph representation for. The added value of tidygraph, is that it opens the door to the functions of the igraph library, all specifically designed for network analysis, and enables us to use them inside a ‘tidy’ workflow. To cover them all, we would need to write a book, but let’s at least show a few examples below.

Centrality measures

Centraltity measures describe the importances of nodes in the network. The simplest of those measures is the degree centrality: the number of edges connected to a node. Another example is the betweenness centrality, which, simply stated, is the number of shortest paths that pass through a node. In tidygraph, we can calculate these and many other centrality measures, and simply add them as a variable to the nodes.

The betweenness centrality can also be calculated for edges. In that case, it specifies the number of shortest paths that pass through an edge.

graph <- graph %>%
  activate(nodes) %>%
  mutate(degree = centrality_degree()) %>%
  mutate(betweenness = centrality_betweenness(weights = length)) %>%
  activate(edges) %>%
  mutate(betweenness = centrality_edge_betweenness(weights = length))

graph

## # A tbl_graph: 3329 nodes and 4667 edges
## #
## # An undirected multigraph with 34 components
## #
## # Edge Data: 4,667 x 7 (active)
##    from    to highway                  geometry edgeID   length betweenness
##   <int> <int> <fct>            <LINESTRING [°]>  <int>      [m]       <dbl>
## 1     1     2 service (7.63275 51.9603, 7.6318…      1 71.2778…       90271
## 2     3     4 second… (7.614156 51.96724, 7.61…      2 24.7146…       32946
## 3     5     6 cyclew… (7.61525 51.96736, 7.615…      3 12.2303…       12409
## 4     7     8 footway (7.629304 51.96712, 7.62…      4 20.0122…       22043
## 5     9    10 steps   (7.627696 51.96534, 7.62…      5  3.2926…       83380
## 6    11    12 service (7.633612 51.96548, 7.63…      6  7.4291…       17119
## # … with 4,661 more rows
## #
## # Node Data: 3,329 x 4
##   nodeID            geometry degree betweenness
##    <int>         <POINT [°]>  <dbl>       <dbl>
## 1      1   (7.63275 51.9603)      3      117041
## 2      2 (7.631843 51.96061)      3       94899
## 3      3 (7.614156 51.96724)      3       32881
## # … with 3,326 more rows

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), col = 'grey50') + 
  geom_sf(data = graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf(), aes(col = betweenness, size = betweenness)) +
  scale_colour_viridis_c(option = 'inferno') +
  scale_size_continuous(range = c(0,4))

< !-- -->

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), aes(col = betweenness, size = betweenness)) +
  scale_colour_viridis_c(option = 'inferno') +
  scale_size_continuous(range = c(0,4))

< !-- -->

Shortest paths

A core part of spatial network analysis is generally finding the path between two nodes that minimizes either the travel distance or travel time. In igraph, there are several functions that can be used for this purpose, and since a tbl_graph is just a subclass of an igraph object, we can directly input it into every function in the igraph package.

The function distances, for example, returns a numeric matrix containing the distances of the shortest paths between every possible combination of nodes. It will automatically choose a suitable algorithm to calculate these shortest paths.

distances <- distances(
  graph = graph,
  weights = graph %>% activate(edges) %>% pull(length)
)

distances[1:5, 1:5]

##            [,1]       [,2]       [,3]      [,4]       [,5]
## [1,]    0.00000   71.27789 1670.22046 1694.9351 1596.80669
## [2,]   71.27789    0.00000 1619.80567 1644.5203 1546.39190
## [3,] 1670.22046 1619.80567    0.00000   24.7146   77.64829
## [4,] 1694.93506 1644.52027   24.71460    0.0000  102.36289
## [5,] 1596.80669 1546.39190   77.64829  102.3629    0.00000

The function ‘shortest_paths’ not only returns distances, but also the indices of the nodes and edges that make up the path. When we relate them to their corresponding geometry columns, we get the spatial representation of the shortest paths. Instead of doing this for all possible combinations of nodes, we can specify from and to which nodes we want to calculate the shortest paths. Here, we will show an example of a shortest path from one node to another, but it is just as well possible to do the same for one to many, many to one, or many to many nodes. Whenever the graph is weighted, the Dijkstra algoritm will be used under the hood. Note here that we have to define the desired output beforehand: vpath means that only the nodes (called vertices in igraph) are returned, epath means that only the edges are returned, and both returns them both.

from_node <- graph %>%
  activate(nodes) %>%
  filter(nodeID == 3044) %>%
  pull(nodeID)

to_node <- graph %>%
  activate(nodes) %>%
  filter(nodeID == 3282) %>%
  pull(nodeID)

path <- shortest_paths(
  graph = graph,
  from = from_node,
  to = to_node,
  output = 'both',
  weights = graph %>% activate(edges) %>% pull(length)
)

path$vpath

## [[1]]
## + 50/3329 vertices, from 2ee524d:
##  [1] 3044 3043 1581 1076 1059 1058 1270 1489  609  608 1549 1550 2998 2448
## [15] 2057 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1468
## [29] 1469 1470 1471 1704 1916 1476   23   24 3064  740  743 1053 1490 1447
## [43] 1446 1445 3052 1444 1443 3051 1442 3282

path$epath

## [[1]]
## + 49/4667 edges from 2ee524d:
##  [1] 3043--3044 1581--3043 1076--1581 1059--1076 1058--1059 1058--1270
##  [7] 1270--1489  609--1489  608-- 609  608--1549 1549--1550 1550--2998
## [13] 2448--2998 2057--2448 1528--2057 1528--1529 1529--1530 1530--1531
## [19] 1531--1532 1532--1533 1533--1534 1534--1535 1535--1536 1536--1537
## [25] 1537--1538 1538--1539 1468--1539 1468--1469 1469--1470 1470--1471
## [31] 1471--1704 1704--1916 1476--1916   23--1476   23--  24   24--3064
## [37]  740--3064  740-- 743  743--1053 1053--1490 1447--1490 1446--1447
## [43] 1445--1446 1445--3052 1444--3052 1443--1444 1443--3051 1442--3051
## [49] 1442--3282

A subgraph of the original graph can now be created, containing only those nodes and edges that are part of the calculated path. Note here that something inconvenient happens: igraph seems to have it’s own underlying structure for attaching indices to nodes, which makes that the from and to columns in the egdes data don’t match with our nodeID column in the nodes data anymore, but instead simply refer to the row numbers of the nodes data.

path_graph <- graph %>%
    subgraph.edges(eids = path$epath %>% unlist()) %>%
    as_tbl_graph()

path_graph

## # A tbl_graph: 50 nodes and 49 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 50 x 4 (active)
##   nodeID            geometry degree betweenness
##    <int>         <POINT [°]>  <dbl>       <dbl>
## 1     23  (7.62837 51.96292)      4      425585
## 2     24 (7.628317 51.96308)      3      499622
## 3    608 (7.626629 51.95707)      3      196573
## 4    609 (7.626621 51.95699)      3      390732
## 5    740 (7.629458 51.96364)      3      158948
## 6    743 (7.629787 51.96392)      3      156575
## # … with 44 more rows
## #
## # Edge Data: 49 x 7
##    from    to highway                  geometry edgeID   length betweenness
##   <int> <int> <fct>            <LINESTRING [°]>  <int>      [m]       <dbl>
## 1     1     2 reside… (7.62837 51.96292, 7.628…     12 18.6254…      426879
## 2     3     4 reside… (7.626629 51.95707, 7.62…    355  8.2067…      198070
## 3     8     9 reside… (7.626552 51.95648, 7.62…    643  6.5847…      149777
## # … with 46 more rows

This subgraph can now be analyzed, for example by calculating the total length of the path, …

path_graph %>%
  activate(edges) %>%
  as_tibble() %>%
  summarise(length = sum(length))

## # A tibble: 1 x 1
##     length
##        [m]
## 1 1362.259

… and plotted.

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey') +
  geom_sf(data = graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey', size = 0.5) +
  geom_sf(data = path_graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), lwd = 1, col = 'firebrick') +
  geom_sf(data = path_graph %>% activate(nodes) %>% filter(nodeID %in% c(from_node, to_node)) %>% as_tibble() %>% st_as_sf(), size = 2)

< !-- -->

However, often we will be interested in shortest paths between geographical points that are not necessarily nodes in the network. For example, we might want to calculate the shortest path from the railway station of Münster to the cathedral.

muenster_station <- st_point(c(7.6349, 51.9566)) %>% 
  st_sfc(crs = 4326)

muenster_cathedral <- st_point(c(7.626, 51.962)) %>%
  st_sfc(crs = 4326)

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey') +
  geom_sf(data = graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey', size = 0.5) +
  geom_sf(data = muenster_station, size = 2, col = 'firebrick') +
  geom_sf(data = muenster_cathedral, size = 2, col = 'firebrick') +
  geom_sf_label(data = muenster_station, aes(label = 'station'), nudge_x = 0.004) +
  geom_sf_label(data = muenster_cathedral, aes(label = 'cathedral'), nudge_x = 0.005)

< !-- -->

To find the route on the network, we must first identify the nearest points on the network. The nabor package has a well performing function to do so. It does, however, require the coordinates of the origin and destination nodes to be given in a matrix.

# Coordinates of the origin and destination node, as matrix
coords_o <- muenster_station %>%
  st_coordinates() %>%
  matrix(ncol = 2)

coords_d <- muenster_cathedral %>%
  st_coordinates() %>%
  matrix(ncol = 2)

# Coordinates of all nodes in the network
nodes <- graph %>%
  activate(nodes) %>%
  as_tibble() %>%
  st_as_sf()

coords <- nodes %>%
  st_coordinates()

# Calculate nearest points on the network.
node_index_o <- knn(data = coords, query = coords_o, k = 1)
node_index_d <- knn(data = coords, query = coords_d, k = 1)
node_o <- nodes[node_index_o$nn.idx, ]
node_d <- nodes[node_index_d$nn.idx, ]

Like before, we use the ID to calculate the shortest path, and plot it:

path <- shortest_paths(
  graph = graph,
  from = node_o$nodeID, # new origin
  to = node_d$nodeID,   # new destination
  output = 'both',
  weights = graph %>% activate(edges) %>% pull(length)
)

path_graph <- graph %>%
    subgraph.edges(eids = path$epath %>% unlist()) %>%
    as_tbl_graph()

ggplot() +
  geom_sf(data = graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey') +
  geom_sf(data = graph %>% activate(nodes) %>% as_tibble() %>% st_as_sf(), col = 'darkgrey', size = 0.5) +
  geom_sf(data = path_graph %>% activate(edges) %>% as_tibble() %>% st_as_sf(), lwd = 1, col = 'firebrick') +
  geom_sf(data = muenster_station, size = 2) +
  geom_sf(data = muenster_cathedral, size = 2)  +
  geom_sf_label(data = muenster_station, aes(label = 'station'), nudge_x = 0.004) +
  geom_sf_label(data = muenster_cathedral, aes(label = 'cathedral'), nudge_x = 0.005)

< !-- -->

It worked! We calculated a path from the rail station to the centre, a common trip taken by tourists visiting Muenster. Clearly this is not a finished piece of work but the post has demonstrated what is possible. Future functionality should look to make spatial networks more user friendly, including provision of ‘weighting profiles’, batch routing and functions that reduce the number of steps needed to work with spatial network data in R.

For alternative approaches and further reading, the following resources are recommended:

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.