King of the Mountain: using R to bag a Strava KOM
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the best features of Strava is the battle to be King (or Queen) of the Mountain. Originally, in cycling, segments were typically climbs or difficult sections of road, and the simple idea, is who can complete the segment in the quickest time. Hence they would be KOM/QOM, King or Queen of the Mountain.
Segments quickly expanded to pretty much any section of a course and to include running segments, to separate them from cycling. In running if you are KOM/QOM you hold the Course Record (CR). There’s a brief rant below about how frustrating this could-be-amazing feature is, if you are interested.
I have a few CRs/KOMs, literally a few, and it would be nice to have more! Strava has been going for a long time, and these days, it is very hard to set a course record. Partly for reasons in the rant below, but mainly because a lot of people have logged their runs and rides on Strava, and inevitably some fast people have done so over these segments. Add in the people that target segments to try and beat the times, then it is difficult to find one that is beatable.
Data science to the rescue
What’s that you say? Something that’s hard or difficult? Let’s see if data science has the answer!
My idea was to find all the local segments for running, and analyse which ones are beatable. Then go out and set a course record. It is not feasible to do this manually due to the number of segments and the fact that Strava doesn’t display all the segment data nicely, so we’ll use R to do it.
There’s a wonderful library {rStrava}
that allows us to interact with the Strava API within R. I won’t discuss how to set it up, there’s guides elsewhere. Briefly, you need to register an app with Strava (they’re OK with individuals doing nerdy stuff with limited data using the API, but they have tightened up in the last year or so on bigger projects), and then use the keys to get an access token.
library(rStrava) library(dplyr) library(ggplot2) # for mapping library(leaflet) library(htmlwidgets) library(googleway) library(mapview) # create authentication token # requires user created app name, id, and secret from Strava website # initial setup #### # Strava key app_name <- "foobar" app_client_id <- "1234" app_secret <- "1234" stoken <- httr::config(token = strava_oauth(app_name, app_client_id, app_secret, app_scope = "activity:read_all", cache = TRUE )) # refresh stale token using the command below # stoken <- httr::config(token = readRDS('.httr-oauth')[[1]])
OK, so now we’re good to go. Let’s pull the data for all segments within a region of interest.
We need a nice big region, to get a load of segments, BUT the request will only return the first 10 segments in a region. So we’ll break down the region into a 10 x 10 grid of squares and search within each so that we maximise the number of segments to analyse. The command get_explore()
allows us to pull the ids of segments within bounds. The bounds are written as a string bnds <- "52.2970, -1.7326, 52.3876, -1.4598"
i.e. bottom left and top right lat/long.
# define bottom-left and top-right corners of a 10 x 10 array of bounding boxes # using 0.00906 increments for lat and 0.02728 for long # starting at 52.2970, -1.7326 # ending at 52.3876, -1.4598 lats <- seq(52.2970, 52.3876, 0.00906) longs <- seq(-1.7326, -1.4598, 0.02728) # create a list of bounding boxes bnds <- lapply(lats, function(x) { lapply(longs, function(y) { paste0(x, ",", y, ",", x + 0.00906, ",", y + 0.02728) }) }) %>% unlist() # for each element in bnds, get the explore data all_segments <- data.frame() for (i in seq_along(bnds)) { data <- get_explore(stoken, bnds[i], activity_type = "running") data$segments %>% lapply(function(x) x$id) %>% unlist() -> segments if(length(segments) == 0) next df <- data.frame(bnds = bnds[i], segments = segments) all_segments <- rbind(all_segments, df) } # unique values in all_segments all_segments_list <- unique(all_segments$segments)
Note that the API currently has a read limit of 2,000 per day and 100 per 15 minutes. For this first part of the code, we’re only requesting 100 things so we’re all good. Later we’ll need to limit the requests.
We now have a list of all segment ids within the region of interest, a total of 483 segments (running only).
Next we need to get the data associated with each segment id. A tricky thing here is that the data fields differ between segments so we need to establish a core set of columns of interest and select those so that we can assemble a big data frame without getting an error from incompatible column numbers.
# columns of interest keepcols <- c( "activity_type", "athlete_count", "athlete_segment_stats.effort_count", "average_grade", "created_at", "distance", "effort_count", "elevation_high", "elevation_low", "elevation_profile", "elevation_profiles.dark_url", "elevation_profiles.light_url", "end_latlng1", "end_latlng2", "hazardous", "id", "map.id", "map.polyline", "map.resource_state", "maximum_grade", "name", "private", "resource_state", "star_count", "starred", "start_latlng1", "start_latlng2", "total_elevation_gain", "updated_at", "xoms.destination.href", "xoms.destination.name", "xoms.destination.type", "xoms.kom", "xoms.overall" ) all_segment_data <- data.frame() for (seg in all_segments_list) { # wait for 9.5 seconds (time out is 100 requests in 15 min) Sys.sleep(9.5) data <- get_segment(stoken, id = seg) data %>% compile_segment() -> data2 if (!is.null(data2) & nrow(data2) > 0) { # keep cols data2 <- data2[, keepcols] all_segment_data <- rbind(all_segment_data, data2) } }
Note that we use a sleep time here so that we don’t get locked out of the API for making too many requests. If you get an error during this loop, it is likely due to a segment missing one of the columns that we want to select. This happened for one segment, I just deleted the id from the list and restarted the loop from there to get it to completion.
Also note, that I am only getting KOM data (and overall leader, although this is usually KOM) and not QOM. This is because not all segments have QOM data, i.e. they have not been attempted by any women runners. Any women out there wanting to try this, there are quite a few QOMs still up for grabs!
Now that we have the data, let’s calculate pace of the best time to figure out which ones are beatable. We have to do this because the segments vary in length and comparing speed or pace is the best way to compare.
# in the column xoms.kom, the time is written as a string # it is either in the format "00:00:00", "00:00", "00s" # convert to seconds all_segment_data$overall_time <- sapply(all_segment_data$xoms.kom, function(x) { if (grepl("s", x)) { as.numeric(gsub("s", "", x)) } else { # determine if it is in the format "00:00:00" or "00:00" by counting the number of ":" if (nchar(x) - nchar(gsub(":", "", x)) == 1) { x <- paste0("00:", x) } as.numeric(strsplit(x, ":")[[1]][1]) * 60 * 60 + as.numeric(strsplit(x, ":")[[1]][2]) * 60 + as.numeric(strsplit(x, ":")[[1]][3]) } }) # convert distance from character to numeric all_segment_data$distance <- as.numeric(all_segment_data$distance) # calculate pace in minutes per km all_segment_data$pace <- (all_segment_data$overall_time / 60) / (all_segment_data$distance / 1000) # pace is in decimal minutes, convert it to a string that is in mm:ss format all_segment_data$pace <- format(as.POSIXct(all_segment_data$pace * 60, origin = "1970-01-01"), format = "%M:%S") # order by pace all_segment_data %>% arrange(pace) -> final_data # save final_data saveRDS(final_data, "Output/Data/segments.rds") saveRDS(all_segments, "Output/Data/all_segments.rds")
It’s worth saving these R objects because of the hassle getting the data via the API.
Our final data frame (final_data
) has all the segment data organised by pace so that checking the beatable segments is quite simple. Let’s see what we have in there.
Analysing the data
As a first look, we can plot the course record time as a function of segment length. There are some very long segments in this region, so we’ll zoom in on those 1.5 km or less.
Now let’s look at pace (the code for all of this is shown below).
I’m looking for segments where the course record equates to a pace of 4 min/km or more. These are the beatable segments. The plots show that there are a few of these in the region of interest, as well as plenty around the 3 min/km which would be much harder to beat.
There are a bunch of segments with silly-fast pace. Over 6% of the segments have a faster than Usain Bolt pace of 1:40 min/km or faster. These are clearly garbage, which is annoying!
There’s also one segment with a very slow pace. I checked it and it seems not to be an error. The segment seems to be a women-only running charity event held on private property, clearly there was a man who jogged the course too and uploaded it to Strava.
We can map out all the segments in R or just show the segments that are beatable.
The code for visualisation is here:
ggplot(final_data, aes(x = distance, y = overall_time)) + geom_point() + geom_smooth(method = "lm") + theme_bw() + labs(x = "Distance (m)", y = "Overall Time (s)") ggsave("Output/Plots/distance_vs_time.png", width = 1200, height = 900, units = "px") final_data %>% filter(distance < 1500) %>% ggplot(aes(x = distance, y = overall_time)) + geom_point() + geom_smooth(method = "lm") + theme_bw() + labs(x = "Distance (m)", y = "Overall Time (s)") ggsave("Output/Plots/distance_vs_time_1500.png", width = 1200, height = 900, units = "px") final_data %>% filter(distance < 1500) %>% ggplot(aes(x = distance, y = as.POSIXct(pace, format="%M:%S"))) + geom_point() + geom_smooth(method = "lm") + theme_bw() + scale_y_datetime(date_labels = "%M:%S") + labs(x = "Distance (m)", y = "Pace (mm:ss/km)") ggsave("Output/Plots/distance_vs_pace.png", width = 1200, height = 900, units = "px") # faster than Usain Bolt nrow(final_data[final_data$pace < "01:40",]) / nrow(final_data) * 100 #[1] 6.431535 # histogram of pace final_data %>% ggplot(aes(x = as.POSIXct(pace, format="%M:%S"))) + geom_histogram(bins = 100) + theme_bw() + scale_x_datetime(date_labels = "%M:%S") + labs(x = "Pace (mm:ss/km)", y = "Frequency") ggsave("Output/Plots/pace_histogram.png", width = 1200, height = 900, units = "px") ## Mapping ---- ## map boundaries lats.range <- c(52.2905, 52.3941) lons.range <- c(-1.4553, -1.7391) # make map map <- leaflet(options = leafletOptions(zoomControl = FALSE)) %>% addProviderTiles('CartoDB.Positron', options = providerTileOptions(noWrap = T, minZoom=12, maxZoom=12)) %>% fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range)) # overlay each segment for (i in 1:nrow(final_data)) { activity <- final_data[i,] # if (activity$pace < "04:00") next coords <- googleway::decode_pl(activity$map.polyline) map <- addPolylines(map, lng = coords$lon, lat = coords$lat, color = 'blue', opacity = 1/2, weight = 2) } map
Let’s set a new course record!
From the map, I could see that there were three beatable segments that could easily be run during one loop. These included “Holly Hop” which had only been tackled 95 times and had a best time of 4:58 for just over 1 km. The course record was set in 2013 which made me wonder if the low number of attempts and the long-standing record meant that it was inaccessible. Anyway, I made a mental note where the segments were, put on my running shoes, and set off.
Well, I set the CR for Holly Hop but failed on the other two. I was one second short on A Mighty Hill and 3 s short on Get up the Hill. Maybe next time…
The analysis showed that there’s a bunch of others for me to attempt in the future too.
Critique of Strava segments
Segments are basically Strava’s USP. Alternative platforms tried to launch equivalent features and were unsuccessful. Conversely, there’s almost nothing else that Strava does that you can’t do on one of the alternatives. Bizarrely, Strava seems to have never understood that segments are its USP.
Frustratingly, bad data plagues the segment leaderboards. As noted in the post, 6.4% of the segments were faster than Usain Bolt pace. These are almost certainly bad data, and anything up to 2:30 min/km is questionable IMO. The bad data comes from:
- GPS blips – technical error.
- Accidental user error – runner gets in car forgets to turn device off. Segments along roads are often contaminated.
- Accidental user error – runner pauses and restarts but is recorded as though the intervening distance has been run.
- Data doping – manipulation of gpx/fit files
Strava’s attitude to all this is just to shrug. They want users to police the leader boards and report dodgy timings. They could easily sweep the leaderboards algorithmically, but they don’t. In fact it’s totally bizarre that they let their best feature just deteriorate. They could also make more out of segments by having yearly KOM/QOM competitions, so that bad data has less chance of wrecking these (they sort of do this now, but they could make much more of this feature) but they don’t. OK, that’s my rant over. There’s plenty of other, more extensive rants out there e.g. here, about the most frustrating company in the exercise business.
—
The post title comes from “King of the Mountain” by Kate Bush from her Aerial album.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.