Web Scraping with rvest: Exploring Sports Industry Jobs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Web scraping with rvest is easy and, surprisingly, comes in handy in situations that you may not have thought of.
For example, one of the unique things about academics is the constant need to stay “ahead of the curve,” meaning being nimble enough as a program to shift curriculum around to provide students training and education in those areas that are in demand within the industry (sports analytics, included).
This is particularly essential in the field of sport management.
Case in point: I am currently serving on a “task force” within our department charged with redefining the “future of our program.” In other words: creating a five-year plan to align our program to fit the needs of the sports industry.
We want our students to have the necessary skills and education required to be as competitive on the job market as possible. And, as if the sports industry job market was not tough enough, the COVID-19 pandemic has only made it more difficult for graduates.
Because of this, we – as a committee – have devised multiple ways to “survey” the industry to see where it is heading in terms of popular and in demand jobs.
A quick way to get a “broad” view of this is by simply scraping online job posting sites such as TeamWork Online or the NCAA Job Market.
Scraping these sites is relatively easy.
And, typically, the code used is easy to edit (typically just needing to change the URL structure and the sequencing around to make it work between sites).
Let’s take a look at how to scrape, for example, the NCAA Job Market website.
Scraping Using rvest: Setting Up A Data.Frame
The first step in the web scraping process is setting up a data frame where all the information will be stored.
It takes a little bit of forward thinking in order to do this correctly.
For example, let’s take a look at the NCAA Job Market website:
When first examining a website for scraping you need to consider the exact information that you want to collect.
What would be beneficial? What would provide insightful data?
In this case, I see three things I want to scrape:
- The title of the job itself
- The institution that is hiring
- And the location of the job
Because of that, I need to create a data frame that includes those three variables. Doing so in RStudio is simple:
listings <- data.frame(title=character(), school=character(), location=character(), stringsAsFactors=FALSE)
Once you create your data frame, it is time to start constructing your script that will actually pull the information off of the site. The first step is developing the sequencing and ensuring that you provide correct URL structure.
Web Scraping with rvest: Sequencing and URL Construction
If you were to visit the NCAA Job Market during the time I wrote this post, you will see that there is currently seven pages of jobs with 25 jobs posted per page.
You, of course, want to grab all of the information beyond just the first page. In order to do this, you have to instruct rvest on how the URLs are structured on the website.
If you click ahead to Page 2 of the NCAA Job Market, you will see in your browser that the URL is structured as such:
https://ncaamarket.ncaa.org/jobs/?page=2
With that in mind, the code starts like this:
for (i in 1:7) { url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i)
Basically, you are instructing rvest to continuously “paste and go” that URL structure but running through the numbers 1 – 7 after page=.
As it does so, it pulls the information off of all seven pages.
This is honestly the trickiest part of scraping, at least to me.
It takes a little trial and error sometimes to figure it out but, once you do it enough times, the process of piecing together this little puzzle becomes easier.
Once you get this part sorted out, you can move onto pulling the information for all the variables we listed above (title, school, and location).
Scraping with rvest: Pulling the Variables
At this point, the last thing you need to do is instruct rvest where exactly the information you are looking for is located on the site.
To better understand this, let’s look at the code:
#job title title <- read_html(url_ds) %>% html_nodes('#jobURL') %>% html_text() %>% str_extract("(\\w+.+)+") #school school <- read_html(url_ds)%>% html_nodes('.bti-ui-job-result-detail-employer') %>% html_text() %>% str_extract("(\\w+).+") #location location <- read_html(url_ds) %>% html_nodes('.bti-ui-job-result-detail-location') %>% html_text() %>% str_extract("(\\w+).+")
As you can see, you are using rvest to read the HTML of the URL you provided.
The most important part here, though, is the html_nodes section.
It is here that you tell rvest where to look for the information.
To get this information, you first need to install the Chrome widget called SelectorGadget.
Once you do that, visit the website, turn on SelectorGadget, and click on the information you want to scrape. You should see something like this:
You can see I highlighted the job title on the NCAA Job Market.
And, then, SelectorGadget is telling me that the title is nested within the DIV tage called ‘#jobURL.’
I take that information and simply insert it into the html_nodes section of the code.
And then do the same for school and location.
Once you do all of that, the last thing you need to do is make sure you do an rbind of all the data within the data frame. After doing so, the complete code looks like this:
listings <- data.frame(title=character(), school=character(), location=character(), stringsAsFactors=FALSE) for (i in 1:7) { url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i) #job title title <- read_html(url_ds) %>% html_nodes('#jobURL') %>% html_text() %>% str_extract("(\\w+.+)+") #school school <- read_html(url_ds)%>% html_nodes('.bti-ui-job-result-detail-employer') %>% html_text() %>% str_extract("(\\w+).+") #location location <- read_html(url_ds) %>% html_nodes('.bti-ui-job-result-detail-location') %>% html_text() %>% str_extract("(\\w+).+") listings <- rbind(listings, as.data.frame(cbind(title, school, location))) }
Lastly, if you want to visualize the information, a wordcloud is a good choice.
wordcloud(paste(listings$title), type="text", lang="english", excludeWords = c("experience","will","work"), textStemming = FALSE, colorPalette="Paired", max.words=5000)
Scraping with rvest: Conclusion
As mentioned, the process of web scraping with rvest is not overly difficult.
Once you figure out the URL structure, the rest kind of falls in place. Of course, the use of SelectorGadget makes it even easier since you do not have to manually dig through the HTML to find where the information you want to scrape is nested.
As for the information gathered from the NCAA Job Market, it should not be surprising that coaching is an in-demand job. Specifically, it looks like Assistant Women’s Coaches are a good in high demand.
Doing the above process on other sites, such as Indeed or TeamWork Online yield vastly different results, however.
You have to keep in mind there is limited amounts of data on the NCAA Job Market – just under 200 jobs.
On the other hands, a search for “sports” on Indeed returns over a thousand results. TeamWork Online has nearly 700 jobs posted.
So, as you can imagine, the wordclouds and the information you can conclude from scraping those websites are a bit broader than the NCAA Job Market.
All said, though, the process of web scraping with rvest quickly lead to some broad, overarching results that can lead to more nuanced discussion.
The post Web Scraping with rvest: Exploring Sports Industry Jobs appeared first on Brad Congelio, Ph.D..
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.