[This article was first published on max humber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
rvest
and purrr
are wonderful bedfellows. The packages share the underlying tidyverse
API. And it feels simple and almost natural to combine them when scraping the web.
Here is a slimmed down and worked recipe of how to leverage rvest
and purrr
in Fantasy Hockey.
Step 0. Load packages.
library(tidyverse) library(rvest) library(purrr) library(stringr)
stringr
to adjust the url for the different position pages. You’ll notice that I’m only grabbing name and goals. Feel free to grab whatever!
p_fetch <- function(position = "C") { url <- str_c(sep = "", "https://www.fantasysp.com/projections/hockey/weekly/", position) page <- read_html(url) names <- page %>% html_nodes("td:nth-child(2)") %>% html_text() goals <- page %>% html_nodes("td:nth-child(4)") %>% html_text() df <- tibble(name = names, goals) return(df) }
pmap
from purrr
to iterate through the Centre, Left-Wing, Right-Wing and Defense position projection pages (I left out the Goalies for obvious reasons).
p_pull <- function() { params <- tibble(position = c("C", "LW", "RW", "D")) df <- params %>% pmap(p_fetch) %>% bind_rows() return(df) }
separate
but it works to get everything into a format that I like.
p_clean <- function() { df <- p_pull() %>% separate(name, into = c("junk", "first", "last", "meta"), sep = "(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", fill = "right", extra = "merge") %>% separate(meta, into = c("team", "position"), sep = "\\s") %>% mutate(name = str_c(first, last, sep = "")) %>% mutate(goals = as.numeric(goals)) %>% drop_na() %>% mutate(length = str_length(team)) %>% filter(length <= 3) %>% select(name, team, position, goals) return(df) } df <- p_clean()
pmap
again to pump through each position to get the mean value for the top X players. It’s a little overkill, but really flexible.
p_replacement <- function(pos, slots) { rp <- df %>% filter(position == pos) %>% arrange(desc(goals)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(goals = mean(goals)) return(rp) } p_vorp <- function() { # slots depend on how many position players start for each team # if there are 10 teams and 2 LW per team then slots -> 10 * 2 = 20 params <- tribble( ~pos, ~slots, "C", 20, "LW", 20, "RW", 20, "D", 20) rp <- params %>% pmap(p_replacement) %>% bind_rows() return(rp) }
replacement <- p_vorp() # calculate value over replacement player vorp <- df %>% left_join(replacement, by = "position") %>% mutate(goals_vorp = goals.x - goals.y) %>% rename(goals = goals.x, goals_rp = goals.y) %>% select(-goals_rp) %>% arrange(desc(goals_vorp))
To leave a comment for the author, please follow the link and comment on their blog: max humber.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.