Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a followup to a post I uploaded earlier today about web scraping data off Google Scholar. In that post I was frustrated because I’m not smart enough to use xpathSApply to get the kind of results I wanted. However fast-forward to the evening whilst having dinner with a friend, as a passing remark, she told me how she had finally figured out how to pass a function to another function in R today, e.g.
example <- function(x, FUN1, FUN2) { a <- sapply(x, FUN1) b <- sapply(a, FUN2) return(b) } example(c(-16,-9,-4,0,4,9,16), abs, sqrt) # [1] 4 3 2 0 2 3 4
Now that might be a little thing to others, but to me that is amazing because I had never figured it out before! Anyway, using this new piece of knowledge I was able to take another shot at the scraping problem by rolling my own version of xpathSApply and was thus able to successfully complete the task!
# One function to rule them all... get_google_scholar_df <- function(u) { html <- getURL(u) # parse HTML into tree structure doc <- htmlParse(html) # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems GS_xpathSApply <- function(doc, path, FUN) { path.base <- "/html/body/div[@class='gs_r']" nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']")) paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE)) xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE) xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) } # construct data frame df <- data.frame( footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']", xmlValue), title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue), type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue), publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_a']", xmlValue), description = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']", xmlValue), cited_by = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,'Cited by')]/text()", xmlValue), cited_ref = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,'Cited by')]", xmlAttrs), title_url = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/a", xmlAttrs), view_as_html = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,'View as HTML')]", xmlAttrs), view_all_versions = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,' versions')]", xmlAttrs), from_domain = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_ggs gs_fl']/a", xmlValue), related_articles = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,'Related articles')]", xmlAttrs), library_search = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_fl']/a[contains(.,'Library Search')]", xmlAttrs), stringsAsFactors = FALSE) # Clean up extracted text df$title <- sub(".*\\] ", "", df$title) df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE)) df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE)) df$type <- gsub("\\]", "", gsub("\\[", "", df$type)) df$cited_by <- as.integer(gsub("Cited by ", "", df$cited, fixed = TRUE)) # remove footer as it is now redundant after doing clean up df <- df[,-1] # free doc from memory free(doc) return(df) }
Then, given a google scholar url, we can scrape the following information for each search result:
u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en" df <- get_google_scholar_df(u) t(df[1, ]) # title "Baldur's gate and history: Race and alignment in digital role playing games" # type "PDF" # publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org" # description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n" # cited_by "8" # cited_ref "/scholar?cites=13835674724285845934&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII&num=20" # title_url "http://digra.org:8080/Plone/dl/db/06276.04067.pdf" # view_as_html "http://scholar.googleusercontent.com/scholar?q=cache:rpHocNswAsAJ:scholar.google.com/+baldur%27s+gate+2&hl=en&oe=ASCII&num=20&as_sdt=0,5" # view_all_versions "/scholar?cluster=13835674724285845934&hl=en&oe=ASCII&num=20&as_sdt=0,5" # from_domain "[PDF] from digra.org" # related_articles "/scholar?q=related:rpHocNswAsAJ:scholar.google.com/&hl=en&oe=ASCII&num=20&as_sdt=0,5" # library_search NA
I think that’s kind of cool. Everything is wrapped into one function which I rather like. This could be extended further by having a function to construct the a series of Google Scholar URLs with whatever parameters you require, including how many pages of results you desire and then put into a loop. The resulting data frames could then be merged and there you have it! You have a nice little data base to do whatever you want with. Not sure what you might want to do with it, but there it is all the same. This was a fun little XPath exercise and even though I didn’t learn how to achieve what I wanted with xpathSApply, it was nice to meta-hack a version of my own to get what I wanted. Awesome stuff.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.