Site icon R-bloggers

If I Had a Text File, I’d Hack Regexes in the Morning

[This article was first published on "R-bloggers" via Tal Galili in Google Reader, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday the topic of academic citation counts came up, so I decided that I should write up some tools for exploring cite counts. The first thing I did was to build a cheap screenscraper in Ruby for pulling citation count information from Google scholar. You’ll see the ugly hack I produced below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
module CitationTools
  require 'rubygems'
  require 'open-uri'
 
  def get_ten_most_cited_works_for_author(author_name)
    # First, let's clean up the author's name before using it in a URL.
    escaped_author_name = author_name.gsub(/s+/, '+')
 
    # Let's create a variable we'll place the Google Scholar HTML in.
    page_content = nil
 
    # Let's figure out the right URL for Google Scholar.
    url = "http://scholar.google.com/scholar?q=#{escaped_author_name}"
 
    # Let's access that URL using open-uri and get the HTML from the page.
    open(url) do |page|
      page_content = page.read()
    end
 
    # Let's scan the HTML for the names of this author's works.
    work_titles = page_content.scan(/<p class=g>.*?>([^<]+)(?:</a></span>)?(?:(?:< size=-1>)|(?:s+-s+<span class=a>)|(?:s+-s+<a class=fl))/)
 
    # Let's scan the HTML for the citation counts for each work.
    cite_counts = page_content.scan(/Cited by (d+)/)
 
    # Let's set aside an array of hashes to store all of this data.
    works = []
 
    # As long as we have the same number of titles and counts, we're good.
    if work_titles.size == cite_counts.size
      work_titles.each_with_index do |title, index|
        works << {:title => title, :citation_count => cite_counts[index]}
      end
      return works
    else
      puts "Failed to process HTML for #{author_name}"
      return nil
    end
 
  end
end

With that in hand, I wrote a simple wrapper to pull information for a list of authors you store in a file called authors.txt from Google Scholar. The wrapper then prints a CSV file to STDOUT that can be redirected to a file for later analysis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Let's include a mix-in with some methods for parsing Google scholar data.
require 'CitationTools'
include CitationTools
 
# Let's pick a haphazard sample of authors.
authors = File.new('authors.txt', 'r').readlines.map {|line| line.chomp}
 
# Let's add a header line to our output.
puts '"Author","Work","Citations"'
 
# And then let's iterate over those authors.
authors.each do |author_name|
  cited_work_data = get_ten_most_cited_works_for_author(author_name)
 
  if cited_work_data.nil?
    print "Skipping #{author_name}"
  end
 
  cited_work_data.each do |cited_work|
    puts ""#{author_name}","#{cited_work[:title]}",#{cited_work[:citation_count]}"
  end
end

Then I coded up a simple barplot in R to give you a sense of the citation count for the first few authors that came to mind. The result is below.

Now I think the goal should be to put these tools to a good use.

To leave a comment for the author, please follow the link and comment on their blog: "R-bloggers" via Tal Galili in Google Reader.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.