ngramr – an R package for Google Ngrams
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.
Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!
The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in population in the late 1970s and had been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.
The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:
// Add column headings, with escaping for JS strings. data.addColumn('number', 'Year'); data.addColumn('number', 'Marx'); data.addColumn('number', 'Freud'); // Add graph data, without autoescaping. data.addRows( [[1900, 2.0528437403299904e-06, 1.2246303970897543e-07], [1901, 1.9467918036752963e-06, 1.1974195999187031e-07], ... [2008, 1.1858645848406013e-05, 1.3913611155658145e-05]] );
With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:
I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.
The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for “mouse”, “Mouse” and “MOUSE” are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).
The package is easy to install from GitHub and I may also post it on CRAN.
I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.