Site icon R-bloggers

Google Scholar (still) sucks

[This article was first published on BMB's commonplace, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(This is a follow-up to my previous post on the topic.)

I was encouraged by the appearance of two R-based Scholar-scrapers, within a week of each other. One, by Kay Cichini, converts the page URLs into text mode and scrapes from there (There’s a slightly hacked version by Tony Breyal on github. The other, by Tony Breyal (github version here), uses XPath.

I started poking around with these functions — they each do some things I like and have some limitations.
Neither of them does what I really want, which is to extract the full bibliographic information. However, when I looked more closely at what GS actually gives you, I got frustrated again. The full title is available, but the bibliographic information is only available in a severely truncated form; the author list and publication (source) title are both truncated if they are too long (!!: e.g. check out this search) Since the “save to [reference manager]” links are available on the page (e.g. this link to BibTeX information: see these instructions on setting a fake cookie), one could in principle go and visit them all, but … this is where we run into trouble. Google Scholar’s robots.txt file contains the line Disallow: /scholar, which according to the definition of the robot-exclusion protocol technically means that we’re not allowed to use a script to visit links starting with http://scholar.google.ca/scholar.bib... as in the example above. Google Scholar does block IP addresses that do too many rapid queries (this is mentioned on the GS help page, and on the aforementioned Python scraper page). It would be easy to circumvent this by pausing appropriately between retrievals, but I’m not comfortable with writing general-purpose code to do that. So: Google Scholar offers a reduced amount of information on the page they return, and prohibits us from spidering the page to retrieve the full bibliographic information. Argh. As a side effect of this, I did take a quick look for existing bibliographic information-handling packages in R (with sos::findFn("bibliograph*")) and found: So: there’s a little more infrastructure out there, but nothing (it seems) that will do what I want without breaking or bending rules. If anyone’s feeling really bored, here are the features I’d like:
I wonder if it’s worth complaining to Google?

To leave a comment for the author, please follow the link and comment on their blog: BMB's commonplace.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.