Scraping XML Tables with R
[This article was first published on Analyst At Large » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!
Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!
People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.
Here’s the code:
###### Settings library(XML) ###### URLs url<-paste0("http://www.basketball-reference.com/players/",letters,"/") len<-length(url) ###### Reading data tbl<-readHTMLTable(url[1])[[1]] for (i in 2:len) {tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])} ###### Formatting data colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College") tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")
To leave a comment for the author, please follow the link and comment on their blog: Analyst At Large » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.