Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Huh… I didn’t realize just how similar rvest was to XML until I did a bit of digging.
After my wonderful experience using dplyr and tidyr recently, I decided to revisit some of my old RUNNING code and see if it could use an upgrade by swapping out the XML dependency with rvest.
Ultra Signup: Treasure Trove of Ultra Data
If you’re into ultra running, then you probably know about Ultra Signup and the kinds of data you can find there: current and historical races results, list of entrants for each upcoming reace, results by runner, etc. I’ve done quite a bit of web scraping on their pages and you can see some of the fun things I’ve done with the data over on my running blog.
rvest versus XML
This post will discuss the mechanics of using rvest vs. XML on scraping the entrants list for the upcoming Rock/Creek StumpJump 50k.
library(magrittr) library(RCurl) library(XML) library(rvest) # Entrants Page for the Rock/Creek StumpJump 50k Race URL <- "http://ultrasignup.com/entrants_event.aspx?did=31114"
Dowloading and Parsing the URL
rvest definitely is compact using only one function. I like it.
rvest_doc <- html(URL)
XML gets its work done with the help of RCurl’s getURL function.
XML_doc <- htmlParse(getURL(URL),asText=TRUE)
And come to find out they return the exact same classed object. I didn’t know that!
class(rvest_doc) ## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" ## [4] "XMLAbstractDocument" all.equal( class(rvest_doc), class(XML_doc) ) ## [1] TRUE
Searching for the HTML Table
rvest seems to poo poo using xpath for selecting nodes in a DOM. Rather, they recommend using CSS selectors instead. Still, the code is nice and compact.
rvest_table_node <- html_node(rvest_doc,"table.ultra_grid")
XML here uses xpath, which I don’t think is that hard to understand once you get used to it. The only other hitch here is that we have to choose the first node returned from getNodeSet.
XML_table_node <- getNodeSet(XML_doc,'//table[@class="ultra_grid"]')[[1]]
But each still returns the exact same classed object.
class(rvest_table_node) ## [1] "XMLInternalElementNode" "XMLInternalNode" ## [3] "XMLAbstractNode" all.equal( class(rvest_table_node), class(XML_table_node) ) ## [1] TRUE
From HTML Table to Data Frame
rvest returns a nice stringy data frame here.
rvest_table <- html_table(rvest_table_node)
While XML must submit to the camelHumpDisaster of an argument name and factor reviled convention of stringsAsFactor=FALSE.
XML_table <- readHTMLTable(XML_table_node, stringsAsFactors=FALSE)
Still, they return almost equal data frames.
all.equal(rvest_table,XML_table) ## [1] "Component "Results": Modes: numeric, character" ## [2] "Component "Results": target is numeric, current is character" all.equal( rvest_table$Results, as.integer(XML_table$Results) ) ## [1] TRUE
Magrittr For More Elegance
Adding in the way cool magrittr pipe system makes rvest really shine in compactness.
rvest_table <- html(URL) %>% html_node("table.ultra_grid") %>% html_table()
While XML is not as elegant, having to use named arguments in getNodeSet and exposing the internal function .subset2.
XML_table <- htmlParse(getURL(URL),asText=TRUE) %>% getNodeSet(path='//table[@class="ultra_grid"]') %>% .subset2(n=1) %>% readHTMLTable(stringsAsFactors=FALSE)
Summing Things Up
rvest is definitely elegant and compact syntactic sugar, which I’m drawn to these days. But scraping web pages reveals the dirtiest data among dirty data, and for now I think I’ll stick to the power of XML over sytactic sugar.
Meh… who am I kidding, I’m just lazy. And old.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.