[This article was first published on Educate-R - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section>
Web Scraping to Item Response Theory: A College Football Adventure
Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem
University of Iowa & University of Minnesota
< section>
Background
- Began after Tim Brewster was fired
- Wanted to try to predict next great coach
< section>
Data Available
- Data is available at three levels
- Coach
- Game by Game
- Team
< section>
Coach
- Data
- Overall record
- Team history
- Not Available
- Coordinator history
< section>
Example Coach Data
## Year Team Win Loss Tie Pct PF PA Delta coach ## 1 2010 Iowa 8 5 0 0.61538 376 221 155 Kirk Ferentz ## 2 2011 Iowa 7 6 0 0.53846 358 310 48 Kirk Ferentz ## 3 2012 Iowa 4 8 0 0.33333 232 275 -43 Kirk Ferentz ## 4 2013 Iowa 8 5 0 0.61538 342 246 96 Kirk Ferentz ## 5 2014 Iowa 7 6 0 0.53846 367 333 34 Kirk Ferentz
< section>
Game by Game
- Data
- Final score of each game
- Date played
- Location
- Not Available
- No information within a game
< section>
Example GBG Data
## Team Official Year Date WL Opponent PF PA ## 1 Iowa University of Iowa 2014 8/30/2014 W Northern Iowa 31 23 ## 2 Iowa University of Iowa 2014 9/6/2014 W Ball St. (IN) 17 13 ## 3 Iowa University of Iowa 2014 9/13/2014 L Iowa St. 17 20 ## 4 Iowa University of Iowa 2014 9/20/2014 W Pittsburgh (PA) 24 20 ## 5 Iowa University of Iowa 2014 9/27/2014 W Purdue (IN) 24 10 ## 6 Iowa University of Iowa 2014 10/11/2014 W Indiana 45 29 ## 7 Iowa University of Iowa 2014 10/18/2014 L Maryland 31 38 ## 8 Iowa University of Iowa 2014 11/1/2014 W Northwestern (IL) 48 7 ## 9 Iowa University of Iowa 2014 11/8/2014 L Minnesota 14 51 ## 10 Iowa University of Iowa 2014 11/15/2014 W Illinois 30 14 ## 11 Iowa University of Iowa 2014 11/22/2014 L Wisconsin 24 26 ## 12 Iowa University of Iowa 2014 11/28/2014 L Nebraska 34 37 ## 13 Iowa University of Iowa 2014 1/2/2015 L Tennessee 28 45 ## Location ## 1 Iowa City, IA ## 2 Iowa City, IA ## 3 Iowa City, IA ## 4 Pittsburgh, PA ## 5 West Lafayette, IN ## 6 Iowa City, IA ## 7 College Park, MD ## 8 Iowa City, IA ## 9 Minneapolis, MN ## 10 Champaign, IL ## 11 Iowa City, IA ## 12 Iowa City, IA ## 13 Jacksonville, FL
< section>
Team
- Data
- Overall team record
- Team statistics
- Rankings
- Conference Affiliation
- Data is very similar to that of the coach level
< section>
Web Scraping
- Data were obtained from many sources
- Much from http://cfbdatawarehouse.com
- Also used wikipedia, ESPN, and rivals
< section>
Iowa Coaches Over Time
< section>
Iowa State Coaches Over Time
< section>
Strengths in web scraping
- Data is relatively easily obtained
- Structured process for obtaining data
- Can be easily updated
< section>
Challenges of web scraping
- At the mercy of the website
- Many sites are old
- Not up to date on current design standards
- Data validation can be difficult and time consuming
- Need some basic knowledge of html
< section>
When is Web Scraping Worthwhile?
- Best when scraping many pages
- Particularly when web addresses are not structured
- Useful when data need to be updated
- Not useful if only scraping a single page/table
< section>
HTML Basics
- HTML is structured by start tags (e.g.
<table>
) and end tags (e.g.<⁄table>
) - Common tags
<h1>
–<h6>
<b>
<i>
<a href="http://www.google.com">
<table>
<p>
<ul>
&<li>
<div>
<img>
- Highly structured pages are the easiest to scrape
< section>
HTML Code Example
< section>
Tools for web scraping
- R
- Python
beautiful soup
: http://www.crummy.com/software/BeautifulSoup/
- Misc
SelectorGadget
: http://selectorgadget.com/
< section>
Basics of rvest
read_html
is the most basic functionhtml_node
orhtml_nodes
- These functions need css selectors or xpath
- SelectorGadget is the easiest way to get this
< section>
SelectorGadget
- SelectorGadget is a Javascript addon for web browsers
- Can quickly identify a css selector or xpath to select correct portion of web page
- Demo:
< section>
Combine SelectorGadget with rvest
library(rvest) wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz") wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard td , .vcard th") head(wiki_kirk_extract) ## {xml_nodeset (6)} ## [1] <td colspan="2" style="text-align:center"><a href="/wiki/File:Kirk_p ... ## [2] <th scope="row">Sport(s)</th> ## [3] <td class="category">n <a href="/wiki/American_football" title="Am ... ## [4] <th colspan="2" style="text-align:center;background-color: lightgray ... ## [5] <th scope="row">Title</th> ## [6] <td>n <a href="/wiki/Head_coach" title="Head coach">Head coach</a> ...
< section>
Extract text
- Use the
html_text
function
wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard td , .vcard th") %>% html_text() head(wiki_kirk_extract) ## [1] "nFerentz at the 2010 Orange Bowln" ## [2] "Sport(s)" ## [3] "Football" ## [4] "Current position" ## [5] "Title" ## [6] "Head coach"
< section>
Encoding problems
- Two solutions to fix encoding problems
guess_encoding
repair_encoding
: fix encoding problems
wiki_kirk %>% html_nodes(".vcard td , .vcard th") %>% html_text() %>% guess_encoding() ## encoding language confidence ## 1 UTF-8 1.00 ## 2 windows-1252 en 0.36 ## 3 windows-1250 ro 0.18 ## 4 windows-1254 tr 0.13 ## 5 UTF-16BE 0.10 ## 6 UTF-16LE 0.10
< section>
Fix Encoding Problems
- Best practice to reload page with correct encoding
wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz", encoding = 'UTF-8')
- Can also repair encoding after the fact
wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard td , .vcard th") %>% html_text() %>% repair_encoding()
< section>
Extract html tags
- Use the
html_tags
function
wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard td , .vcard th") %>% html_name() head(wiki_kirk_extract) ## [1] "td" "th" "td" "th" "th" "td"
< section>
Extract html attributes
- Use the
html_attrs
function
wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard td , .vcard th") %>% html_attrs() head(wiki_kirk_extract) ## [[1]] ## colspan style ## "2" "text-align:center" ## ## [[2]] ## scope ## "row" ## ## [[3]] ## class ## "category" ## ## [[4]] ## colspan ## "2" ## style ## "text-align:center;background-color: lightgray;" ## ## [[5]] ## scope ## "row" ## ## [[6]] ## named character(0)
< section>
Extract links
- Use the
html_attrs
function again
wiki_kirk_extract <- wiki_kirk %>% html_nodes(".vcard a") %>% html_attr('href') head(wiki_kirk_extract) ## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG" ## [2] "/wiki/American_football" ## [3] "/wiki/Head_coach" ## [4] "/wiki/Iowa_Hawkeyes_football" ## [5] "/wiki/Big_Ten_Conference" ## [6] "/wiki/Iowa_City,_Iowa"
< section>
Valid Links
- The
paste0
function is helpful for this
valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract) head(valid_links) ## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG" ## [2] "https://www.wikipedia.org/wiki/American_football" ## [3] "https://www.wikipedia.org/wiki/Head_coach" ## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football" ## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference" ## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"
< section>
Extract Tables
- The
html_table
function is useful to scrape well formatted tables
record_kirk <- wiki_kirk %>% html_nodes(".wikitable") %>% .[[1]] %>% html_table(fill = TRUE)
< section>
Caveats to Web Scraping
- Keep in mind when scraping we are using their bandwidth
- Do not want to repeatedly do expensive bandwidth operations
- Better to scrape once, then run only to update data
- Some websites are copyrighted (i.e. illegal to scrape)
< section>
Data Modeling
- Research Questions
- Who is the next great coach?
- What characteristics are in common for these coaches?
< section>
IRT modeling
- So far we have explored the win/loss records of teams in the BCS era with item response theory (IRT)
- IRT is commonly used to model assessment data to estimate item parameters and person ‘ability’
- We recode the Win/Loss/Tie game by game results
- 1 = Win
- 0 = Otherwise
< section>
Example code with lme4
- A 1 parameter multilevel IRT model can be fitted using
glmer
in thelme4
package
library(lme4) fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team), data = yby_coach, family = binomial)
< section>
Plot Showing Team Ability
< section>
Connect
- e-mail: brandon-lebeau (at) uiowa.edu
- Twitter: @blebeau11; https://twitter.com/blebeau11
- Linkedin: https://www.linkedin.com/in/lebeaubr
- Website: http://educate-r.org
To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.