Heatmapping My New York City Marathon Training

R | JLaw's R Blog

12 months ago

[This article was first published on R | JLaw's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

This post was inspired by my wife who used the GPS data from her Strava app to plot her running routes during 2020. Since I don’t run nearly as much as I used to, I need to go back to when I was training for the NYC marathon to find enough running to make such a map worthwhile. Also presenting a challenge is that I’m a bit of a luddite when it comes to running technology. I don’t have a GPS watch and I don’t run with a phone. To track my runs I manually enter my routes and workouts into MapMyRun and I time my runs with an ol’ fashioned sportswatch.

While this works for me on the road, it made the data gathering process for this visualization more difficult. And while MapMyRun does have TCX files for each workout, its not that useful if the data didn’t come from a GPS watch.

At the end of the day, my goal with this analysis is to make a cool looking heatmap of my training routes for the NYC Marathon… or at least to make a visualization that was cooler looking that my wife’s.

For those who can’t wait… this was final output:

Libraries Used

This analysis uses four main packages. Tidyverse for data manipulation, sf for modifying spatial data, tigris for getting the basemaps to plot my routes and extra to bring in new s for the plots.

library(tidyverse) # Data Manipulation
library(sf) # Manipulation Spatial Data
library(tigris) # Getting Tract and Roads Spatial Data
library(extra) # Better Fonts For GGPLOT

Gathering Data

If I had a GPS watch or used Strava, I could just download all my files which would contain Geo information and plot it directly. But because I do everything manually, I needed to jump through some hoops. From my MapMyRun account I was able to download:

user_workout_history.csv – Containing all of my workouts along with a column for route_id.
GPX files for each route that I had saved.

This led to the semi-painful manual process of using the first file to write down each route id that I had run, look up that route, and download the individual GPX file. Fortunately, I’m a creature of habit and and ran the same routes often, so there were only 24 to individually download.

The User Workout History File

This file was a CSV file exported from MapMyRun which contained one row for each workout I did along with meta-data such as date, time, speed, etc. However, the important column is route id which will be used to join the geo-data from the route’s GPX files.

runs <- read_csv('data/user_workout_history.csv') %>% 
  # Create Route ID column
  mutate(route_id = str_extract(RouteID, '\\d+') %>% as.integer)

The Route GPX Files

As mentioned above the geocoded data for each route lives in GPX files, one for each of the 24 routes. Since I would apply the same pre-processing to each file this is a good candidate for the map_dfr function to construct the data frame.

The following code uses dir() to get a list of all the files in the directory as vectors, the keep() function trims the vector to only the GPX files, and each GPX file is then passed into read_sf to read in the geo-data. The data is subset to only two columns, and a route_id is created based on the numbers in the file name.

Finally, geo-data in sf lives in a GEOMETRY column. However, in order to get the latitudes and longitudes as individual columns I use st_coodinates to creates “X” and “Y” columns for longitude and latitude.

all_routes <- map_dfr(
  #Get all gpx files in the directory
  keep(dir('data'), ~str_detect(.x, "gpx")),
  #Read them in
  ~read_sf(paste0('data/',.x), layer = "track_points") %>% 
    #keep the segment id and the geometry field
    select(track_seg_point_id, geometry) %>% 
    # create a route_id based on the file
    mutate(route_id = parse_number(.x))
) %>% 
  #Extract Lat and Long as Columns
  cbind(., st_coordinates(.))

After the processing the data looks like:

track_seg_point_id	route_id	X	Y	geometry
0	111694131	-73.97597	40.77624	POINT (-73.97597 40.77624)
1	111694131	-73.97555	40.77605	POINT (-73.97555 40.77605)
2	111694131	-73.97555	40.77605	POINT (-73.97555 40.77605)
3	111694131	-73.97546	40.77582	POINT (-73.97546 40.77582)
4	111694131	-73.97546	40.77582	POINT (-73.97546 40.77582)
5	111694131	-73.97552	40.77527	POINT (-73.97552 40.77527)

Combining Runs and Routes

With all the workouts in the runs data and all the routes in the all_routes data, a simple inner-join will combine them. This will duplicates routes that I ran multiple times, which in this case would be the desired behavior.

#Join Routes to Runs to Duplicate 
runs_and_routes <- runs %>% 
  inner_join(all_routes, by = "route_id")

Creating a map of NYC

Since the goal is to create a heatmap of the various routes I ran as part of marathon training, I need a map that contains all of the possible roads in NYC. The tigris package allows for the access to US Census TIGER shapefiles. One of the levels is “roads”, which can be downloaded using the road() function where the first parameter is state and 2nd parameter is county (New York County is Manhattan):

###Download Roads Map from Tigris
nyc <- roads("NY", "New York")
ggplot() + geom_sf(data = nyc) + ggthemes::theme_map()

The function provides road data for all of Manhattan. However, I did not run every part of Manhattan, so it would make more sense to truncate the map to areas where I did run.

In order to do this, I first need to define a boundary box based on my routes. Given a geometry, the st_bbox() function from sf will return a “bbox” object containing the four corners of my routes.

st_bbox(runs_and_routes$geometry)
##      xmin      ymin      xmax      ymax 
## -74.01880  40.70806 -73.93118  40.82113

However, this will not provide any padding around my running routes which will make for a worse visualization. So I will use map2_dbl to add a delta of 0.01 to the maximum values and remove a delta of -0.01 to the minimum values to slightly increase the bounding box.

### Construct Bounding Boxes and Expand Limits By A Delta
bbox <- map2_dbl(
  st_bbox(runs_and_routes$geometry),
  names(st_bbox(runs_and_routes$geometry)),
  ~if_else(str_detect(.y, 'min'), .x - .01, .x + .01)
)

bbox
##      xmin      ymin      xmax      ymax 
## -74.02880  40.69806 -73.92118  40.83113

With an updated bounding box, I can now crop the initial map with my bounding box using the st_crop() function. Also, in order to make the Coordinate Reference Systems the same, I use st_crs() and st_transform to make sure the NYC map is using the same coordinates as my routes.

#Set CRS for NYC to CRS for Running Routes And Crop to the Bounding Box
nyc2 <- st_transform(nyc, crs = st_crs(runs_and_routes$geometry)) %>% 
  st_crop(bbox)

ggplot() + geom_sf(data = nyc2) + ggthemes::theme_map()

We’ve now cut off Governor’s Island from the bottom left corner as well as parts of Northern Manhattan that I never ran to.

Constructing the Heatmap

With the new basemap created and the route data in its own data frame. I can create the heatmap using stat_density2d with the route data and geom_sf with the map data. From the stat_density2d piece I pass in the routes data and set the fill value to be the count at each X and Y using the after_stat() option. The n parameter sets the number of grid points in each directions for the density.

The base map is very rectangular where it is tall but skinny. This made it difficult to add titles. To make things look better, I use ggdraw from the cowplot package to create a new drawing layer and add titles/captions to that layer.

p <- ggplot() + 
  #Construct the Heatmap Portion
  stat_density2d(data = runs_and_routes,
                 aes(x = X, y = Y, fill = after_stat(count)),
                 geom = 'tile',
                 contour = F,
                 n = 1024
                 ) +
  #Draw the Map of Manhattan
  geom_sf(data = nyc2, color = '#999999', alpha = .15) + 
  scale_fill_viridis_c(option = "B", guide = F) + 
  ggthemes::theme_map() + 
  theme(
    panel.background = element_rect(fill = 'black'),
    plot.background = element_rect(fill = 'black')
  )

cowplot::ggdraw(p) + 
  labs(title = "JLaw's Marathon Training Heatmap",
       caption = "**Author**: JLaw") + 
  theme(panel.background = element_rect(fill = "black"),
        plot.background = element_rect(fill = 'black'),
        plot.title = element_text(color = "#DDDDDD",
                                  family = 'Nirmala UI',
                                  #face = 'bold',
                                  size = 18),
        plot.caption = ggtext::element_markdown(color = '#DDDDDD',
                                    family = 'Calibri Light',
                                    hjust = 1,
                                    size = 12),
        

  )

Concluding Thoughts

I’m really happy with how this came out. It also provides some information about my running habits, mainly that I ran in Central Park a lot and that you can roughly tell where I worked at the time as that area is slightly hotter. There are some parts of Manhattan that I did run but don’t show up well in the map because I might have only run there once. An exploration of how much of Manhattan did I run will be covered in a follow-up post.

To leave a comment for the author, please follow the link and comment on their blog: R | JLaw's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.