How to Scrape and Store Strava Data Using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post by Julian During is the third place winner in the Call for Documentation contest. Julian is a data scientist from Germany working in the manufacturing industry. Julian loves working with R (especially the tidyverse ecosystem), sports, black coffee and cycling.
I am an avid runner and cyclist. For the past couple of years, I have recorded almost all my activities on some kind of GPS device.
I record my runs with a Garmin device and my bike rides with a Wahoo device, and I synchronize both accounts on Strava. I figured that it would be nice to directly access my data from my Strava account.
In the following text, I will describe the progress to get Strava data into R, process the data, and then create a visualization of activity routes. You can find the original analysis in this Github repository.
You will need the following packages:
library(tarchetypes) library(conflicted) library(tidyverse) library(lubridate) library(jsonlite) library(targets) library(httpuv) library(httr) library(pins) library(httr) library(fs) library(readr) conflict_prefer("filter", "dplyr")
Set Up Targets
The whole data pipeline is implemented with the help of the targets
package. You can learn more about the package and its functionalities here.
In order to reproduce the analysis, perform the following steps:
- Clone the repository: https://github.com/duju211/pin_strava
- Install the packages listed in the ‘libraries.R’ file
- Run the target pipeline by executing
targets::tar_make()
command - Follow the instructions printed in the console
Target Plan
The manifest of the target plan looks like this:
name | command | pattern | cue_mode |
---|---|---|---|
my_app | define_strava_app() | NA | thorough |
my_endpoint | define_strava_endpoint() | NA | thorough |
act_col_types | list(moving = col_logical(), velocity_smooth = col_number(), grade_smooth = col_number(), distance = col_number(), altitude = col_number(), time = col_integer(), lat = col_number(), lng = col_number(), cadence = col_integer(), watts = col_integer()) | NA | thorough |
my_sig | define_strava_sig(my_endpoint, my_app) | NA | always |
df_act_raw | read_all_activities(my_sig) | NA | thorough |
df_act | pre_process_act(df_act_raw, athlete_id) | NA | thorough |
act_ids | pull(distinct(df_act, id)) | NA | thorough |
df_meas | read_activity_stream(act_ids, my_sig) | map(act_ids) | never |
df_meas_all | bind_rows(df_meas) | NA | thorough |
df_meas_wide | meas_wide(df_meas_all) | NA | thorough |
df_meas_pro | meas_pro(df_meas_wide) | NA | thorough |
gg_meas | vis_meas(df_meas_pro) | NA | thorough |
df_meas_norm | meas_norm(df_meas_pro) | NA | thorough |
gg_meas_save | save_gg_meas(gg_meas) | NA | thorough |
We will go through the most important targets in detail.
OAuth Dance from R
The Strava API requires an ‘OAuth dance’, described below.
1. Create an OAuth Strava app
To get access to your Strava data from R, you must first create a Strava API. The steps are documented on the Strava Developer site. While creating the app, you’ll have to give it a name. In my case, I named it r_api
.
After you have created your personal API, you can find your Client ID and Client Secret variables in the Strava API settings. Save the Client ID as STRAVA_KEY
and the Client Secret as STRAVA_SECRET
in your R environment.1
STRAVA_KEY=<Client ID> STRAVA_SECRET=<Client Secret>
Then, you can run the function define_strava_app
shown below.
name | command | pattern | cue_mode |
---|---|---|---|
my_app | define_strava_app() | NA | thorough |
define_strava_app <- function() { oauth_app( appname = "r_api", key = Sys.getenv("STRAVA_KEY"), secret = Sys.getenv("STRAVA_SECRET") ) }
2. Define an endpoint
Define an endpoint called my_endpoint
using the function define_strava_endpoint
.
The authorize
parameter describes the authorization url and the access
argument exchanges the authenticated token.
name | command | pattern | cue_mode |
---|---|---|---|
my_endpoint | define_strava_endpoint() | NA | thorough |
define_strava_endpoint <- function() { oauth_endpoint(request = NULL, authorize = "https://www.strava.com/oauth/authorize", access = "https://www.strava.com/oauth/token") }
3. The final authentication step
Before you can execute the following steps, you have to authenticate the API in the web browser.
name | command | pattern | cue_mode |
---|---|---|---|
my_sig | define_strava_sig(my_endpoint, my_app) | NA | always |
define_strava_sig <- function(endpoint, app) { oauth2.0_token( endpoint, app, scope = "activity:read_all,activity:read,profile:read_all", type = NULL, use_oob = FALSE, as_header = FALSE, use_basic_auth = FALSE, cache = FALSE ) }
The information in my_sig
can now be used to access Strava data. Set the cue_mode
of the target to ‘always’ so that the following API calls are always executed with an up-to-date authorization token.
Access Activities
You are now authenticated and can directly access your Strava data.
1. Load all activities
Load a table that gives an overview of all the activities from the data. Because the total number of activities is unknown, use a while loop. It will break the execution of the loop if there are no more activities to read.
name | command | pattern | cue_mode |
---|---|---|---|
df_act_raw | read_all_activities(my_sig) | NA | thorough |
read_all_activities <- function(sig) { activities_url <- parse_url("https://www.strava.com/api/v3/athlete/activities") act_vec <- vector(mode = "list") df_act <- tibble::tibble(init = "init") i <- 1L while (nrow(df_act) != 0) { r <- activities_url %>% modify_url(query = list( access_token = sig$credentials$access_token[[1]], page = i )) %>% GET() df_act <- content(r, as = "text") %>% fromJSON(flatten = TRUE) %>% as_tibble() if (nrow(df_act) != 0) act_vec[[i]] <- df_act i <- i + 1L } df_activities <- act_vec %>% bind_rows() %>% mutate(start_date = ymd_hms(start_date)) }
The resulting data frame consists of one row per activity:
## # A tibble: 605 x 60 ## resource_state name distance moving_time elapsed_time total_elevation~ type ## <int> <chr> <dbl> <int> <int> <dbl> <chr> ## 1 2 "Hes~ 31153. 4699 5267 450 Ride ## 2 2 "Bam~ 5888. 2421 2869 102. Run ## 3 2 "Lin~ 33208. 4909 6071 430 Ride ## 4 2 "Mon~ 74154. 10721 12500 641 Ride ## 5 2 "Cha~ 34380 5001 5388 464. Ride ## 6 2 "Mor~ 5518. 2345 2563 49.1 Run ## 7 2 "Bin~ 10022. 3681 6447 131 Run ## 8 2 "Tru~ 47179. 8416 10102 898 Ride ## 9 2 "Sho~ 32580. 5646 6027 329. Ride ## 10 2 "Mit~ 33862. 5293 6958 372 Ride ## # ... with 595 more rows, and 53 more variables: workout_type <int>, id <dbl>, ## # external_id <chr>, upload_id <dbl>, start_date <dttm>, ## # start_date_local <chr>, timezone <chr>, utc_offset <dbl>, ## # start_latlng <list>, end_latlng <list>, location_city <lgl>, ## # location_state <lgl>, location_country <chr>, start_latitude <dbl>, ## # start_longitude <dbl>, achievement_count <int>, kudos_count <int>, ## # comment_count <int>, athlete_count <int>, photo_count <int>, ...
2. Preprocess activities
Make sure that all ID columns have a character format and improve the column names.
name | command | pattern | cue_mode |
---|---|---|---|
df_act | pre_process_act(df_act_raw, athlete_id) | NA | thorough |
pre_process_act <- function(df_act_raw, athlete_id) { df_act <- df_act_raw %>% mutate(across(contains("id"), as.character), `athlete.id` = athlete_id) }
3. Extract activity IDs
Use dplyr::pull()
to extract all activity IDs.
name | command | pattern | cue_mode |
---|---|---|---|
act_ids | pull(distinct(df_act, id)) | NA | thorough |
Read Measurements
1. Read the ‘stream’ data from Strava
A ‘stream’ is a nested list (JSON format) with all available measurements of the corresponding activity.
To get the
available variables and turn the result into a data frame, define a helper function read_activity_stream
. This function takes an ID of an activity and an authentication token, which you created earlier.
The target is defined with dynamic branching which maps over all activity IDs. Define the cue mode
as never
to make sure that every target runs exactly once.
name | command | pattern | cue_mode |
---|---|---|---|
df_meas | read_activity_stream(act_ids, my_sig) | map(act_ids) | never |
read_activity_stream <- function(id, sig) { act_url <- parse_url(stringr::str_glue("https://www.strava.com/api/v3/activities/{id}/streams")) access_token <- sig$credentials$access_token[[1]] r <- modify_url(act_url, query = list( access_token = access_token, keys = str_glue( "distance,time,latlng,altitude,velocity_smooth,cadence,watts, temp,moving,grade_smooth" ) )) %>% GET() stop_for_status(r) fromJSON(content(r, as = "text"), flatten = TRUE) %>% as_tibble() %>% mutate(id = id) }
2. Bind the single targets into one data frame
You can do this using dplyr::bind_rows()
.
name | command | pattern | cue_mode |
---|---|---|---|
df_meas_all | bind_rows(df_meas) | NA | thorough |
The data now is represented by one row per measurement series:
## # A tibble: 4,821 x 6 ## type data series_type original_size resolution id ## <chr> <list> <chr> <int> <chr> <chr> ## 1 moving <lgl [4,706]> distance 4706 high 62186~ ## 2 latlng <dbl [4,706 x 2]> distance 4706 high 62186~ ## 3 velocity_smooth <dbl [4,706]> distance 4706 high 62186~ ## 4 grade_smooth <dbl [4,706]> distance 4706 high 62186~ ## 5 distance <dbl [4,706]> distance 4706 high 62186~ ## 6 altitude <dbl [4,706]> distance 4706 high 62186~ ## 7 heartrate <int [4,706]> distance 4706 high 62186~ ## 8 time <int [4,706]> distance 4706 high 62186~ ## 9 moving <lgl [301]> distance 301 high 62138~ ## 10 latlng <dbl [301 x 2]> distance 301 high 62138~ ## # ... with 4,811 more rows
3. Turn the data into a wide format
name | command | pattern | cue_mode |
---|---|---|---|
df_meas_wide | meas_wide(df_meas_all) | NA | thorough |
meas_wide <- function(df_meas) { pivot_wider(df_meas, names_from = type, values_from = data) }
In this format, every activity is one row again:
## # A tibble: 605 x 14 ## series_type original_size resolution id moving latlng velocity_smooth ## <chr> <int> <chr> <chr> <list> <list> <list> ## 1 distance 4706 high 6218628649 <lgl ~ <dbl ~ <dbl [4,706]> ## 2 distance 301 high 6213800583 <lgl ~ <dbl ~ <dbl [301]> ## 3 distance 4905 high 6179655557 <lgl ~ <dbl ~ <dbl [4,905]> ## 4 distance 10640 high 6160486739 <lgl ~ <dbl ~ <dbl [10,640]> ## 5 distance 4969 high 6153936896 <lgl ~ <dbl ~ <dbl [4,969]> ## 6 distance 2073 high 6115020306 <lgl ~ <dbl ~ <dbl [2,073]> ## 7 distance 1158 high 6097842884 <lgl ~ <dbl ~ <dbl [1,158]> ## 8 distance 8387 high 6091990268 <lgl ~ <dbl ~ <dbl [8,387]> ## 9 distance 5587 high 6073551706 <lgl ~ <dbl ~ <dbl [5,587]> ## 10 distance 5281 high 6057232328 <lgl ~ <dbl ~ <dbl [5,281]> ## # ... with 595 more rows, and 7 more variables: grade_smooth <list>, ## # distance <list>, altitude <list>, heartrate <list>, time <list>, ## # cadence <list>, watts <list>
4. Preprocess and unnest the data
The column latlng
needs special attention, because it contains latitude and longitude information. Separate the two measurements before unnesting all list columns.
name | command | pattern | cue_mode |
---|---|---|---|
df_meas_pro | meas_pro(df_meas_wide) | NA | thorough |
meas_pro <- function(df_meas_wide) { df_meas_wide %>% mutate( lat = map_if( .x = latlng, .p = ~ !is.null(.x), .f = ~ .x[, 1] ), lng = map_if( .x = latlng, .p = ~ !is.null(.x), .f = ~ .x[, 2] ) ) %>% select(-c(latlng, original_size, resolution, series_type)) %>% unnest(where(is_list)) }
After this step, every row is one point in time and every column is a measurement at this point in time (if there was any activity at that moment).
## # A tibble: 2,176,926 x 12 ## id moving velocity_smooth grade_smooth distance altitude heartrate time ## <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 621862~ FALSE 0 1.8 0 527 149 0 ## 2 621862~ TRUE 0 1.2 5 527. 150 1 ## 3 621862~ TRUE 0 0.9 10.9 527. 150 2 ## 4 621862~ TRUE 5.68 0.8 17 527. 150 3 ## 5 621862~ TRUE 5.81 0.8 23.3 527. 150 4 ## 6 621862~ TRUE 5.88 0.8 29.4 527. 150 5 ## 7 621862~ TRUE 6.13 0.8 35.6 527. 151 6 ## 8 621862~ TRUE 6.15 0 41.6 527. 150 7 ## 9 621862~ TRUE 6.14 0 47.8 527. 150 8 ## 10 621862~ TRUE 6.13 0.8 53.9 527. 150 9 ## # ... with 2,176,916 more rows, and 4 more variables: cadence <dbl>, ## # watts <dbl>, lat <dbl>, lng <dbl>
Create Visualisation
Visualize the final data by displaying the geospatial information in the data. Every facet is one activity. Keep the rest of the plot as minimal as possible.
name | command | pattern | cue_mode |
---|---|---|---|
gg_meas | vis_meas(df_meas_pro) | NA | thorough |
vis_meas <- function(df_meas_pro) { df_meas_pro %>% filter(!is.na(lat)) %>% ggplot(aes(x = lng, y = lat)) + geom_path() + facet_wrap( ~ id, scales = "free") + theme( axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), legend.position = "bottom", panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.background = element_blank(), strip.text = element_blank() ) }
And there it is: All your Strava data in a few tidy data frames and a nice-looking plot. Future updates to the data shouldn’t take too long, because only measurements from new activities will be downloaded. With all your Strava data up to date, there are a lot of appealing possibilities for further data analyses of your fitness data.
Note from the Editor: Julian’s post neatly breaks down complex tasks, walking readers through the steps as well as rationale of his decisions. His use of the targets package demonstrates how an organized workflow enables replicability and ease. In addition, Julian showcases how the R programming language can fulfill a vision sparked by one’s passions. It is an inspiring example of how we can use R to create something that is informative, beautiful, and personal.
You can edit your R environment by running
usethis::edit_r_environ()
, saving the keys, and then restarting R.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.