Site icon R-bloggers

Google Insights and RCurl

[This article was first published on Dan Knoepfle's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Google Insights is nifty. If you’re logged in to your Google account, you can download the results as a CSV file. This is straightforward if you’re using a browser; if you’re trying to retrieve the results of queries using R, however, things get more complicated.

The following code retrieves the results of a Google Insights search for “Sarah Palin” as a data.frame. It uses the RCurl package to do all of the hard work.

username <- "username@gmail.com"
password <- "password_here"

loginURL <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"

require(RCurl)

ch <- getCurlHandle()

curlSetOpt(curl = ch,
            ssl.verifypeer = FALSE,
            useragent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13",
            timeout = 60,
            followlocation = TRUE,
            cookiejar = "./cookies",
            cookiefile = "./cookies")


## do Google Account login
loginPage <- getURL(loginURL, curl = ch)

require(stringr)
galx.match <- str_extract(string = loginPage,
                          pattern = ignore.case('name="GALX"\\s*value="([^"]+)"'))
galx <- str_replace(string = galx.match,
                    pattern = ignore.case('name="GALX"\\s*value="([^"]+)"'),
                    replacement = "\\1")

authenticatePage <- postForm(authenticateURL, .params = list(Email = username, Passwd = password, GALX = galx), curl = ch)


## get Google Insights results CSV
insightsURL <- "http://www.google.com/insights/search/overviewReport"
resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", content = 1, export = 1), curl = ch)

if(isTRUE(unname(attr(resultsText, "Content-Type")[1] == "text/csv"))) {
  ## got CSV file

  ## create temporary connection from results
  tt <- textConnection(resultsText)

  resultsCSV <- read.csv(tt, header = FALSE)

  ## close connection
  close(tt)
} else {
  ## something went wrong

  ## probably need to log in again?

}

download ‘Google Insights.R’ from gist.github.com

I don’t have much else to say about this, but I hope that it will be helpful to someone.

You can change the query to incorporate geographic restrictions or such by adding the parameters that appear in the URL when you change your search through the Google Insights web search; for instance, a basic search for “QUERY” gives URL http://www.google.com/insights/search/#q=QUERY&cmpt=q whereas the same search restricted to the state of New York has URL http://www.google.com/insights/search/#q=QUERY&geo=US-NY&cmpt=q; the added parameter is “geo=US-NY”. To incorporate this into the script, change

resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", content = 1, export = 1), curl = ch)

to have the additional parameter in the .params list:

resultsText <- getForm(insightsURL, .params = list(q = "Sarah Palin", cmpt = "q", geo = "US-NY", content = 1, export = 1), curl = ch)

[Updated 2012-04-24]

To leave a comment for the author, please follow the link and comment on their blog: Dan Knoepfle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.