Scraping with Selenium

Posted on December 10, 2014 by Andrew Brooks - R in R bloggers | 0 Comments

[This article was first published on Andrew Brooks - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you’ve ever…

felt like you’re playing Simon Says with mouse clicks when repeatedly extracting data in chunks from a front-end interface to a database on the web, well, you probably are. There’s probably a better solution – Selenium.

ever used XML or httr in R or urllib2 in Python, you’ve probably encountered the situation where the source code you’ve scraped for a website doesn’t contain all the information you see in your browser. Selenium can probably help.

How it works

Selenium is a web automation tool. While not developed specifically for web scraping, Selenium does it pretty dang well. Selenium literally “drives” your browser, so it can see anything you see when you right click and inspect element in Chrome or Firefox. This vastly widens the universe of content that can be extracted from automation, but can be slow as all content must be rendered in the browser.

There are headless (invisible browsers with no GUI) such as phantomjs that speed some of this up. That said, I’ve found that Selenium works best for targeted extraction where the user knows exactly what they want.

Example

I set out to collect tickers for all mutual funds in the asset allocation fund type. Fidelity provides a list of all these funds here. 1,586 funds as of today in 80 conveniently paginated URLs. Each URL ends in &pgNo=5 to indicate you want page 5 (or whatever number between 1 and 80).

In my browser, when I hover my mouse over one of the fund names in the table, I see the 5 character ticker I’m looking for. I also see the tickers directly on the webpage when I click the link to each fund. Here for example, where it says PSLDX in the top left. However, if possible I’d like to scrape the tickers from the table rather than the individual fund pages. This would mean 80 pages to scrape rather than 1,586.

Take 1: traditional http request

When possible, it makes sense to use the simple traditional methods. So I first tried to extract these tickers with the popular httr R package by making standard http requests.

library('httr')
url <- 'https://www.fidelity.com/fund-screener/evaluator.shtml#!&ft=BAL_all&ntf=N&expand=%24FundType&rsk=5'
page <- GET(url)

Success…

print(http_status(page))
## $category
## [1] "success"
## 
## $message
## [1] "success: (200) OK"

But did our http request return the information we want?

page_text <- content(page, as='text')

Nope – can’t find the tickers (one of them anyway).

grepl('GMMAX', page_text, ignore.case=T)
## [1] FALSE

Nope – can’t even find the fund name that I see in the table from the webpage in my browser.

grepl('Aberdeen', page_text, ignore.case=T)
## [1] FALSE

It appears that the content of interest is being generated dynamically with Javascript and Ajax on this webpage. So the raw HTML of this page doesn’t help us much.

My plan B was to grab the url for each fund from the table, navigate to that fund’s page, and extract the ticker from there. However these links weren’t in our http response. I noticed that the URLs for each fund followed a simple consistent structure.
https://fundresearch.fidelity.com/mutual-funds/summary/72201F433 for example. I thought maybe I could find 72201F433 which looks like some sort of fund ID in a list with all fund IDs in the http response. No dice. Plan C – Selenium.

Take 2: Selenium

I used the RSelenium R package for this mini project. There are also Selenium bindings for Python, Java, C#, Javascript and Ruby which make replicating this process in your programming language of choice relatively straightforward.

Step 1: Fire up Selenium

library('RSelenium')
checkForServer() # search for and download Selenium Server java binary.  Only need to run once.
startServer() # run Selenium Server binary
remDr <- remoteDriver(browserName="firefox", port=4444) # instantiate remote driver to connect to Selenium Server
remDr$open(silent=T) # open web browser

Step 2: Start scraping

To figure which DOM elements I wanted Selenium extract, I used the Chrome Developer Tools which can be invoked by right clicking a fund in the table and selecting Inspect Element. The HTML displayed here contains exactly what we want, what we didn’t see with our http request.

Since I want to grab all the funds at once, I tell Selenium to select the whole table. Going a few levels up from the individual cell in the table I’ve selected, I see that <tbody id="tbody"> is the HTML tag that contains the entire table, so I tell Selenium to find this element. I use the nifty highlightElement function to confirm graphically in the browser that this is what I think it is.

Then it’s business as usual. I parse the string output from Selenium into an HTML tree and use XPath to parse the table for just the fund name and ticker.

library('XML')
master <- c()
n <- 5 # number of pages to scrape.  80 pages in total.  I just scraped 5 pages for this example.
for(i in 1:n) {
  site <- paste0("https://www.fidelity.com/fund-screener/evaluator.shtml#!&ntf=N&ft=BAL_all&msrV=advanced&sortBy=FUND_MST_MSTAR_CTGY_NM&pgNo=",i) # create URL for each page to scrape
  remDr$navigate(site) # navigates to webpage
  
  elem <- remDr$findElement(using="id", value="tbody") # get big table in text string
  elem$highlightElement() # just for interactive use in browser.  not necessary.
  elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
  elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
  fundList <- unlist(xpathApply(elemxml, '//input[@title]', xmlGetAttr, 'title')) # parses out just the fund name and ticker using XPath
  master <- c(master, fundList) # append fund lists from each page together
}

head(master)
## [1] "FidelityAA Global Balanced Fund (FGBLX)"                    
## [2] "FidelityAA Global Strategies Fund (FDYSX)"                  
## [3] "Fidelity FreedomA 2055 Fund (FDEEX)"                       
## [4] "Aberdeen Dynamic Allocation Fund Class A (GMMAX)"           
## [5] "Aberdeen Dynamic Allocation Fund Class C (GMMCX)"           
## [6] "AllianceBernstein Real Asset Strategy Advisor Class (AMTYX)"

Step 3: Extract ticker

Nothing fancy here – just separating the ticker from the fund name.

master2 <- data.frame(sapply(master, function(x) substr(x, nchar(x)-5, nchar(x)-1)))
master2$name <- sapply(master, function(x) substr(x, 0, nchar(x)-8))
names(master2) <- c('ticker', 'name')

head(master2)
##   ticker                                                name
## 1  FGBLX                     FidelityAA Global Balanced Fund
## 2  FDYSX                   FidelityAA Global Strategies Fund
## 3  FDEEX                        Fidelity FreedomAA 2055 Fund
## 4  GMMAX            Aberdeen Dynamic Allocation Fund Class A
## 5  GMMCX            Aberdeen Dynamic Allocation Fund Class C
## 6  AMTYX AllianceBernstein Real Asset Strategy Advisor Class

What else can Selenium do?

My little example makes use of the simple functionality provided by Selenium for web scraping – rendering HTML that is dynamically generated with Javascript or Ajax. Since Selenium is actually a web automation tool, one can be much more sophisticated by using it to automate a human navigating a webpage with mouse clicks and writing and submitting forms. This can be a huge time saver for researchers that rely on front-end interfaces on the web to extract data in chunks.

Here’s a basic example using Python.
Here’s a basic example using R.

Getting setup

On the several computers I use, I’ve found setup ranging from seamless to frustrating.

The most frustrating issue I encountered while setting up on my Mac was this error message:

> remDr$open()

[1] "Connecting to remote server"
Error:    Summary: UnknownError
 	 Detail: An unknown server-side error occurred while processing the command.
 	 class: java.lang.IllegalStateException

I was able to resolve it by killing all processes running on port 4444 and trying again.

At the terminal:

lsof -i :4444

Kill PIDs of any processes listed. For example:

kill 30681

To leave a comment for the author, please follow the link and comment on their blog: Andrew Brooks - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Scraping with Selenium

If you’ve ever…

How it works

Example

Take 1: traditional http request

Take 2: Selenium

What else can Selenium do?

Getting setup

Related

If you’ve ever…

How it works

Example

Take 1: traditional http request

Take 2: Selenium

What else can Selenium do?

Getting setup

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)