Scrape Web data using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you. It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.
As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.
One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.
Bring on fantasy football!
################################################################ ## Help from the followingn sources: ## @DataJunkie on twitter ## http://www.regular-expressions.info/reference.html ## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package ## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package ## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage ################################################################ library(XML) library(stringr) # build the URL url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB", "&conference=NFL&year=season_2009", "&timeframe=Week1", sep="") # read the tables and select the one that has the most rows tables <- readHTMLTable(url) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) tables[[which.max(n.rows)]] # select the table we need - read as a dataframe my.table <- tables[[7]] # delete extra columns and keep data rows View(head(my.table, n=20)) my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ] # rename every column c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att", "R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL") names(my.table) <- c.names # data get read in with wierd symbols - need to remove - initially stored as character factors # for the loops, I am manually telling the code which regex to use - assumes constant behavior # depending on where the wierd characters are -- is this an encoding? front <- c(1) back <- c(4:ncol(my.table)) for(f in front) { test.front <- as.character(my.table[, f]) tt.front <- str_sub(test.front, start=3) my.table[,f] <- tt.front } for(b in back) { test <- as.character(my.table[ ,b]) tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*")) my.table[, b] <- tt.back } str(my.table) View(my.table) # clear memory and quit R rm(list=ls()) q() n
Filed under: Fantasy Football, How-to, NFL, R Tagged: fantasy football, R, web scraping
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.