Analyzing extreme skiing and snowboarding in R: Freeride World Tour 1996–2018
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Freeride World Tour (FWT) has been hosting extreme skiing & snowboarding events since 1996. Having just wrapped up the 2018 season in March I did an analysis on rankings and past FWT winners using R.
If you haven’t heard of the FWT yet it’s an exciting sport where riders choose gnarley-looking lines through cliff-faces, cornices and nasty couloirs (like this line). There’s no artificial jumps or half-pipes just a gate at the top and one at the bottom. The judges use five criteria (Difficulty of Line, Control, Fluidity, Jumps and Technique) which are scored from 0 to 100.
My desire to do this project was mainly to practice some web-scraping, accessing the Twitter API and reinforce my own understanding of the concepts applied within. Skiing and snowboarding are forgotten when it comes to sports analytics — I mean even tennis has an R package– so I thought it would be cool project.
#FWT18 on Twitter
First, I collected data from the Twitter API using the TwitteR package. To do this, I needed to set up a developer account to authenticate the connection from R (a good tutorial on how to do this is here).
library(twitteR) # set objects api_key <- "your api key" api_secret <- "your api secret" access_token <- "your access token" access_token_secret <- "your access token secret" # authenticate setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
Quickly visualize the frequency of tweets mentioning #FWT18
I get a warning that I asked the Twitter API for a maximum of 1,000 tweets but it only returned 58 from the past 9 days.
It would be nice to get a longer history but the twitter API only indexs a few days worth of tweets and does not allow historic queries (there’s a [Python package that can](https://github.com/Jefferson-Henrique/GetOldTweets-python) but I haven’t tried it out yet.
Where in the world are people discussing #FWT18?
It looks like most of the tweets originated from Europe and North America, although we see a few tweets from Japan, Australia and Africa.
Note that geo-tagged tweets are only available for those who have opted in for that in the settings, which is a mere 3% of Twitter users.
What are some of the top words coming up in tweets about #FWT18?
This time I included @FreerideWTour and the Twitter handles of a few of the riders by using a function that looks for mentions and hash-tags and then create a wordcloud for the most popular things being mentioned along the #FWT18 tag.
Looks like crashes and weekend are used often.
Exploring FWT18 Rankings
Since the data is not available as a .txt or a .csv file on the website, nor do they provide and API I needed to crawl for it.
It’s worth mentioning that administrators may want to protect certain parts of their website for a number of reasons, “such as indexing of an unannounced site, traversal of parts of the site which require vast resources of the server, recursive traversal of an infinite URL space, etc.”
Therefore, one should always check if they have permission. One way to do this, is to use the robotstxt package to check if your webbot has permission to access certain parts of a webpage (Thanks to Maëlle Salmon for that tip).
library(robotstxt) paths_allowed("https://www.freerideworldtour.com/rider/") [1] TRUE
Okay, it looks like we have permission.
Unfortunately the code for the FWT 2018 rankings page is “fancy” meaning one needs to click the drop-down arrows to get a riders score for each event.
I think the data is being loaded with JavaScript which means that I would need to use a program which can programmatically click the button. I’ve heard splashr or RSelenium may accomplish this. But, I’m new to web-scraping and only familiar with rvest so I came up with a (relatively) quick work-around.
I placed the names from the 2018 roster into a dataset and loaded it as an object. I can automatically crawl every rider by feeding these names to rvest with a for loop to the end of https://www.freerideworldtour.com/rider/
# read in the names from 2018 roster roster <- read_csv("https://ndownloader.figshare.com/files/11173433") # create a url prefix url_base <- "https://www.freerideworldtour.com/rider/" riders <- roster$name # Assemble the dataset output <- data_frame() for (i in riders) { temp <- read_html(paste0(url_base, i)) %>% html_node("div") %>% html_text() %>% gsub("\\s*\\n+\\s*", ";", .) %>% gsub("pts.", "\n", .) %>% read.table(text = ., fill = T, sep = ";", row.names = NULL, col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>% subset(select = 2:5) %>% dplyr::filter( !is.na(as.numeric(as.character(Ranking))) & as.character(Points) != "" ) %>% dplyr::mutate(name = i) output <- bind_rows(output, temp) }
I was going to look at the overall standings for each category (skiing and snowboarding) broken-down by how many points athletes earned at each FWT event in 2018; however, I noticed there was something odd going on.
# How many riders in the roster? unique(roster) # A tibble: 56 x 3 name sport sex <chr> <chr> <chr> 1 aymar-navarro ski male 2 berkeley-patt~ ski male 3 carl-regner-e~ ski male 4 conor-pelton ski male 5 craig-murray ski male 6 drew-tabke ski male 7 fabio-studer ski male 8 felix-wiemers ski male 9 george-rodney ski male 10 grifen-moller ski male # ... with 46 more rows # How many names in the output object? unique(output$name) [1] "aymar-navarro" [2] "berkeley-patterson" [3] "carl-regner-eriksson" [4] "conor-pelton" [5] "craig-murray" [6] "drew-tabke" [7] "fabio-studer" [8] "felix-wiemers" [9] "george-rodney" [10] "grifen-moller" [11] "ivan-malakhov" [12] "kristofer-turdell" [13] "leo-slemett" [14] "logan-pehota" [15] "loic-collomb-patton" [16] "markus-eder" [17] "mickael-bimboes" [18] "reine-barkered" [19] "ryan-faye" [20] "sam-lee" [21] "stefan-hausl" [22] "taisuke-kusunoki" [23] "thomas-rich" [24] "trace-cooke" [25] "yann-rausis" [26] "arianna-tricomi" [27] "elisabeth-gerritzen" [28] "eva-walkner" [29] "hazel-birnbaum" [30] "jaclyn-paaso" [31] "kylie-sivell" [32] "lorraine-huber" [33] "rachel-croft" [34] "blake-hamm" [35] "christoffer-granbom" [36] "clement-bochatay" [37] "davey-baird"
Turns out the name elias-elhardt, which was row 40, is creating an issue. I’m not sure why this was but since Elias only competed in the qualifiers let’s simply remove him then re-run the code from above.
# Remove Elias Elhardt roster <- roster[-40,] riders <- roster$name
Historical FWT winners
The FWT lists past event winners on their website. I gathered the data of all winners from the 23 tours between 1996 and 2018 and included their age from either the website or a quick web-search. The dataset can be found on figshare.
# load the data df <- read_csv("https://ndownloader.figshare.com/files/11300864") # Get summary statistics on age of winners df %>% summarize(mean_age = median(age, na.rm = TRUE), max_age = max(age, na.rm = TRUE), min_age = min(age, na.rm = TRUE)) mean_age max_age min_age 1 29 43 15 # Find minimum age of winner by sex and sport df %>% group_by(sex, sport) %>% slice(which.min(age)) %>% dplyr::select(name, sex, sport, age) # A tibble: 4 x 4 # Groups: sex, sport [4] name sex sport age <chr> <chr> <chr> <int> 1 Arianna Tricomi female ski 23 2 Michelle Gmitro female snowboard 16 3 George Rodney male ski 21 4 Cyril Neri male snowboard 15
How many times have riders won FWT events?
The large number of riders who won at least one FWT event dwarfs those unique athlets who won a considerable number of events.
We can have another look at the data in a table.
Xavier De Le Rue is near the top with 8 and appears under both the ski and snowboarder categories? That’s strange. We can replace that mistake in the data set like this.
df$sport[df$name == "Xavier De Le Rue"] <- "snowboard"
Which countries have the most winners?
Looks like USA, France and Austria produce some of the best free riders.
How old are the winners from each country?
We can use box plots to get an idea of how the age distribution for each country looks like.
However, a better way to get a visual overview of distributions is with a rain cloud plot. Since a few of the countries only had one competitor let’s remove them.
It may also be interesting to ask how has the age of winners changed in the history of the competition.
Are FWT winners getting younger with time?
There doesn’t appear to be any trend with age and winning FWT events over time.
It would have been nice to play around with data on how judges scored each riders run from 0 to 100 for each of the five categories for each event but it doesn’t look like that sort of information is available to the public at this time.
The full code for this analysis can be found on Github.
Analyzing extreme skiing and snowboarding in R: Freeride World Tour 1996–2018 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.