Google Scholar (still) sucks
[This article was first published on BMB's commonplace, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(This is a follow-up to my previous post on the topic.)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was encouraged by the appearance of two R-based Scholar-scrapers, within a week of each other. One, by Kay Cichini, converts the page URLs into text mode and scrapes from there (There’s a slightly hacked version by Tony Breyal on github. The other, by Tony Breyal (github version here), uses XPath.
I started poking around with these functions — they each do some things I like and have some limitations.
- Cichini’s version:
- is based on plain old text-scraping, which is easy for me to understand.
- has a nice loop for fetching multiple pages of results automatically.
- has (to me) a silly output format — her code automatically generates a word cloud, and can dump a csv file to disk if requested. It would be very easy and make more sense to break this up into separate functions: a scraper which returned a data frame and a wordcloud creator which accepted a data frame as input …
- Breyal’s version:
- is based on XPath, which seems more magical to me but is probably more robust in the long run
- extracts numbers of citations
Disallow: /scholar
, which according to the definition of the robot-exclusion protocol technically means that we’re not allowed to use a script to visit links starting with http://scholar.google.ca/scholar.bib...
as in the example above. Google Scholar does block IP addresses that do too many rapid queries (this is mentioned on the GS help page, and on the aforementioned Python scraper page). It would be easy to circumvent this by pausing appropriately between retrievals, but I’m not comfortable with writing general-purpose code to do that. So: Google Scholar offers a reduced amount of information on the page they return, and prohibits us from spidering the page to retrieve the full bibliographic information. Argh. As a side effect of this, I did take a quick look for existing bibliographic information-handling packages in R (with sos::findFn("bibliograph*")
) and found: CITAN
: a Scopus-centric package that uses a SQLite backend and does heavy-duty bibliometric analysis (h-indices, etc.)RISmed
is Pubmed-centric and defines aReference
class (seems sensible but geared pretty narrowly towards article-type references). It imports RIS format (a common tagged format used by ISI and others)ris
: a similar (?) package without the PubMed interfacebibtex
: parses BibTeX filesRMendeley
from the ROpenSci project
- ISI is big and evil and explicitly disallows scripted access.
- PubMed doesn’t cover ecology as well as I’d like.
- I might be able to use Scopus but would prefer something Open (this is precisely why GS’s cripplage annoys me so much).
- Mendeley is nice, and perhaps has most of what I really want, but ideally I would prefer something with systematic coverage [my understanding is that the Mendeley databases would have everything that everyone has bothered to include in their personal databases …]
- I wonder if JSTOR would like to play … ?
- scrape or otherwise save information to a variety of useful fields (author, date, date, source title, title, keywords, abstract?
- save/identify various types (e.g. article/book chapter etc.)
- allow dump to CSV file
- citation information would be cool — e.g. to generate co-citation graphs –but might get big
I wonder if it’s worth complaining to Google?
To leave a comment for the author, please follow the link and comment on their blog: BMB's commonplace.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.