Site icon R-bloggers

Web Scraping to Item Response Theory – A College Football Adventure

[This article was first published on Educate-R - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< section>

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

< section>

Background

< section>

Data Available

< section>

Coach

< section>

Example Coach Data

##   Year Team Win Loss Tie     Pct  PF  PA Delta        coach
## 1 2010 Iowa   8    5   0 0.61538 376 221   155 Kirk Ferentz
## 2 2011 Iowa   7    6   0 0.53846 358 310    48 Kirk Ferentz
## 3 2012 Iowa   4    8   0 0.33333 232 275   -43 Kirk Ferentz
## 4 2013 Iowa   8    5   0 0.61538 342 246    96 Kirk Ferentz
## 5 2014 Iowa   7    6   0 0.53846 367 333    34 Kirk Ferentz

< section>

Game by Game

< section>

Example GBG Data

##    Team           Official Year       Date WL          Opponent PF PA
## 1  Iowa University of Iowa 2014  8/30/2014  W     Northern Iowa 31 23
## 2  Iowa University of Iowa 2014   9/6/2014  W     Ball St. (IN) 17 13
## 3  Iowa University of Iowa 2014  9/13/2014  L          Iowa St. 17 20
## 4  Iowa University of Iowa 2014  9/20/2014  W   Pittsburgh (PA) 24 20
## 5  Iowa University of Iowa 2014  9/27/2014  W       Purdue (IN) 24 10
## 6  Iowa University of Iowa 2014 10/11/2014  W           Indiana 45 29
## 7  Iowa University of Iowa 2014 10/18/2014  L          Maryland 31 38
## 8  Iowa University of Iowa 2014  11/1/2014  W Northwestern (IL) 48  7
## 9  Iowa University of Iowa 2014  11/8/2014  L         Minnesota 14 51
## 10 Iowa University of Iowa 2014 11/15/2014  W          Illinois 30 14
## 11 Iowa University of Iowa 2014 11/22/2014  L         Wisconsin 24 26
## 12 Iowa University of Iowa 2014 11/28/2014  L          Nebraska 34 37
## 13 Iowa University of Iowa 2014   1/2/2015  L         Tennessee 28 45
##              Location
## 1       Iowa City, IA
## 2       Iowa City, IA
## 3       Iowa City, IA
## 4      Pittsburgh, PA
## 5  West Lafayette, IN
## 6       Iowa City, IA
## 7    College Park, MD
## 8       Iowa City, IA
## 9     Minneapolis, MN
## 10      Champaign, IL
## 11      Iowa City, IA
## 12      Iowa City, IA
## 13   Jacksonville, FL

< section>

Team

< section>

Web Scraping

< section>

Iowa Coaches Over Time

< section>

Iowa State Coaches Over Time

< section>

Strengths in web scraping

< section>

Challenges of web scraping

< section>

When is Web Scraping Worthwhile?


< section>

HTML Basics

  • <h1><h6>
  • <b> <i>
  • <a href="http://www.google.com">
  • <table>
  • <p>
  • <ul> & <li>
  • <div>
  • <img>

< section>

HTML Code Example

< section>

Tools for web scraping

< section>

Basics of rvest

< section>

SelectorGadget

< section>

Combine SelectorGadget with rvest

library(rvest)
wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz")
wiki_kirk_extract <- wiki_kirk %>%
    html_nodes(".vcard td , .vcard th")
head(wiki_kirk_extract)

## {xml_nodeset (6)}
## [1] <td colspan="2" style="text-align:center"><a href="/wiki/File:Kirk_p ...
## [2] <th scope="row">Sport(s)</th>
## [3] <td class="category">n  <a href="/wiki/American_football" title="Am ...
## [4] <th colspan="2" style="text-align:center;background-color: lightgray ...
## [5] <th scope="row">Title</th>
## [6] <td>n  <a href="/wiki/Head_coach" title="Head coach">Head coach</a> ...

< section>

Extract text

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text()
head(wiki_kirk_extract)

## [1] "nFerentz at the 2010 Orange Bowln"
## [2] "Sport(s)"                           
## [3] "Football"                           
## [4] "Current position"                   
## [5] "Title"                              
## [6] "Head coach"

< section>

Encoding problems

wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text() %>%
  guess_encoding()

##       encoding language confidence
## 1        UTF-8                1.00
## 2 windows-1252       en       0.36
## 3 windows-1250       ro       0.18
## 4 windows-1254       tr       0.13
## 5     UTF-16BE                0.10
## 6     UTF-16LE                0.10

< section>

Fix Encoding Problems

wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz", 
                       encoding = 'UTF-8')
wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text() %>% 
  repair_encoding()

< section>

Extract html tags

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_name()
head(wiki_kirk_extract)

## [1] "td" "th" "td" "th" "th" "td"

< section>

Extract html attributes

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_attrs()
head(wiki_kirk_extract)

## [[1]]
##             colspan               style 
##                 "2" "text-align:center" 
## 
## [[2]]
## scope 
## "row" 
## 
## [[3]]
##      class 
## "category" 
## 
## [[4]]
##                                          colspan 
##                                              "2" 
##                                            style 
## "text-align:center;background-color: lightgray;" 
## 
## [[5]]
## scope 
## "row" 
## 
## [[6]]
## named character(0)

< section>

Extract links

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard a") %>%
  html_attr('href')
head(wiki_kirk_extract)

## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "/wiki/American_football"                           
## [3] "/wiki/Head_coach"                                  
## [4] "/wiki/Iowa_Hawkeyes_football"                      
## [5] "/wiki/Big_Ten_Conference"                          
## [6] "/wiki/Iowa_City,_Iowa"

< section>

Valid Links

valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract)
head(valid_links)

## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "https://www.wikipedia.org/wiki/American_football"                           
## [3] "https://www.wikipedia.org/wiki/Head_coach"                                  
## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football"                      
## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference"                          
## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"

< section>

Extract Tables

record_kirk <- wiki_kirk %>%
  html_nodes(".wikitable") %>%
  .[[1]] %>%
  html_table(fill = TRUE)

< section>

Caveats to Web Scraping

< section>

Data Modeling

< section>

IRT modeling

< section>

Example code with lme4

library(lme4)
fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team), 
              data = yby_coach, family = binomial)

< section>

Plot Showing Team Ability

< section>

Connect

To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.