Web Scraping Google Sheets with RSelenium
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I love to learn new things and one of ways I learn best is by doing. Also it’s been said that you never fully understand a topic until you are able to explain it , I think blogging is a low barrier to explaining things.
Someone I met at a local data science meetup in Montréal wanted help web scraping to get team standings from the PuzzledPint. I jumped at the opportunity because I knew this would be my opportunity to finally learn RSelenium!
Static Scraping vs. Dynamic Scraping
Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the content you’re looking for is available, you need to go no further. However, if the content is something like an `iframe`, you need dynamic scraping.
Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it’s looking for. Sometimes you need to automate the browser by simulating a user to get the content you need. In order for me to get the same details for the remaining posts, i would need to first navigate to the next page, which involves clicking the Next button at the bottom of the search results page.
Setting up RSelenium with Docker
RSelenium provides R bindings for the Selenium Webdriver API. Selenium is a project focused on automating web browsers.
You need to follow the installation instructions for Docker Toolbox on Windows or Ubuntu.
Docker is used to run applications by using containers. Containers are simply a bundle of libraries and other dependencies in one package. You can think of it like a virtual machine, but rather than creating a whole OS it allows applications to use the same Linux kernel with only the things not already running on the host computer. Basically, it gives a significant performance boost and reduces the size of the application. Moreover, you can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.
You’ll also need to install TightVNC which will allow you to see how you’re manipulating the web page in real-time with RSelenium.
Next follow the instructions to create a Docker container running a selenium server and its own firefox.
Note: Once you’ve set up the docker container (and everytime you restart your computer or start-up fresh again) open the Docker Quickstart Terminal and run the following command.
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
Now that you’ve booted your Docker Quickstart Terminal go into R and connect to a running server.
library(RSelenium) remDr <- remoteDriver( remoteServerAddr = "192.168.99.100", port = 4445L ) remDr$open()
Navigate to the page using Rselenium.
# navigate to the website of interest remDr$navigate("http://www.puzzledpint.com/standings/") # confirm you got there remDr$getTitle()0
Okay let’s get a live screen shot of the site in the Viewer tab of RStudio.
remDr$screenshot(display = TRUE)
Keep in mind that this is just a static screen shot. Your going to want to use TightVNC to get a live view of your interactions while your developing your pipeline so you can see how your interacting with the website.
It’s important to be watching TightVNC as you use the …$highlightElement() in-between your …$findElement() and …$switchToFrame()/…$clickElement() commands so that you actually know your selecting the appropriate things!
Open the TighVNC Viewer and enter the port number; in this case 192.168.99.100 and enter that in the Remote Host: field. Click Connect and for the password enter the word: secret.
Note: If TightVNC ever stops working (on Windows 10 it did often) and gives you the Error: “NO CONNECTION COULD BE MADE BECAUSE THE TARGET MACHINE ACTIVELY REFUSED IT” then follow the steps for “Debugging Using VNC” here.
Accessing Elements in the DOM
Web pages are a set of nest objects (together, they are known as the Document Object Model or DOM for short). It’s a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Interacting with the DOM will be very important for us with RSelenium.
Hadley Wickham recommends using Selectorgadget, a Chrome extension, to help identify the web page elements you need. And he recommends this page for learning more about selectors.
For example by using SelectorGadget you can the table you’re interested in. In this case it says it’s an iframe. To isolate just the Montréal standings we’ll click another box to only select the one of interest: iframe:nth-child(68).
In the context of a web browser, a frame is a part of a web page or browser window which displays content independent of its container, with the ability to load content independently. In this case the website is pulling in data from another source to display these tables interactively apart from the main standings page. Luckily for me they are all from Google Sheets so this will make it much easier for me. Unfortunately, you can’t find the links for these sheets with `selectorgadget`. You will need to take a closer look at the source code using the Developer’s Tool called `Inspector` in either Chrome of Firefox. If you have Windows and Firefox you would click Open Menu then Web Developer > Inspector or just Ctrl+Shift+c. Then I used the search box to look for the link (src=) for Montréal.
For me it was a big pain to actually find what I was looking for as sometimes the highlight looks like what you want but it’s not. For example:
In the end I guess to figure out HTML it involves gradual “Denial and Error” attempts.
Webscraping Google Sheets with RSelenium
Legal Disclaimer
It’s worth mentioning that administrators may want to protect certain parts of their website for a number of reasons, such as “indexing of an unannounced site, traversal of parts of the site which require vast resources of the server, recursive traversal of an infinite URL space, etc.”
Therefore, one should always check if they have permission. One way to do this, is to use the robotstxt package to check if your web-bot has permission to access certain parts of a web-page.
# check permissions library(robotstxt) paths_allowed("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
If it says TRUE on the specific page you have permission. Alternatively, just go to the robots.txt file on the url of the main page to get a broader sense of what is (and isn’t) allowed.
Method # 1
Sometimes websites can be composed using frames. These are in effect seperate webpages which are brought together in a frameset. We will need to jump back-and-forth between these frames.
library(RSelenium) library(XML) library(janitor) library(lubridate) library(magrittr) library(dplyr) remDr <- remoteDriver( remoteServerAddr = "192.168.99.100", port = 4445L ) remDr$open() # Now open TightVNC to follow along with Selenium driving the browser. Set 192.168.99.100:5901 and password: secret # navigate to the main page remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156") # https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=552987877 # look for table element tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # switch to table remDr$switchToFrame(tableElem) # parse html doc <- htmlParse(remDr$getPageSource()[[1]]) table_tmp <- readHTMLTable(doc) table_tmp <- table_tmp[[1]][-2,-1] table_tmp <- table_tmp[-1,] colnames(table_tmp) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved") table_tmp$city <- rep("montreal", nrow(table_tmp)) table_tmp$date <- rep(Sys.Date()-5, nrow(table_tmp))
Now that we have the first month we can create a for loop for the rest of the dates. First let’s switch back the outer frame and select the elements we will be manipulating.
# switch back to the main/outter frame remDr$switchToFrame(NULL) # find the elements you'll manipulate with Inspector mode in a browser webElems <- remDr$findElements(using = "css", ".switcherItem") # Month/Year tabs at the bottom arrowElems <- remDr$findElements(using = "css", ".switcherArrows") # Arrows to scroll left and right at the bottom tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame
So I know there’s going to be many tables, but just how many? We can check this use this via length(webElems).
There is actually 49 tables in total but since we started on the first one above there is only 48 links. Rather than hard-coding 1:48 it’s better to do it via code as there will be more tables added in the future.
# Create NULL object to be used in forloop big_df <- NULL for (i in seq(length(webElems))){ # for every check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE)) if (any(class(check) == "try-error")) { # choose the i'th Month/Year tab webElem <- webElems[[i]] webElem$clickElement() # Find the tableElem again other wise you get a StaleElementReference ## TO DO: look into WebDriverWait: https://stackoverflow.com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame # switch to table frame remDr$switchToFrame(tableElem) Sys.sleep(3) # parse html with XML package doc <- htmlParse(remDr$getPageSource()[[1]]) Sys.sleep(3) # Extract data from HTML table in HTML doucment table_tmp <- readHTMLTable(doc) Sys.sleep(3) # put this into a format you can use table <- table_tmp[[1]][-2,-1] table <- table[-1,] # rename the columns colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved") # add city name to a column table$city <- rep("Montreal", nrow(table)) # add the Month/Year this table was extracted from today <- Sys.Date() %m-% months(i + 1) table$date <- today # concatenate each table together big_df <- dplyr::bind_rows(big_df, table) # Switch back to the main frame remDr$switchToFrame(NULL) arrowElem <- arrowElems[[1]] # once you "click"" the element it is "held down" as far as I know there is no way to " unclick" # to prevent it from scrolling too far I make sure not to take too long by setting the sleep short arrowElem$clickElement() # give it "just enough time" to scroll right Sys.sleep(0.3) # switch back to outer frame to re-start the loop remDr$switchToFrame(NULL) } } temp1 <- dplyr::bind_rows(table_tmp, big_df)
The problem here is that the for loop eventually fails at the end when it tries to click the right arrow but it’s as far to the right as it can go — therefore it won’t download the last few tables (~5). Typically one would handle such conditions with something like:
check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE)) if (any(class(check) == "try-error")) { # do stuff }
and it usually works fine, including when using selenium. The issue encountered here however is clicking on the arrow once would always bring me to the last visible sheets — skipping everything in middle. Therefore my work around to get the remaining sheets was this:
# ctrl+x from the google sheet then use the read.delim() to assign it to an object march <- read.delim("clipboard") february <- read.delim("clipboard") january <- read.delim("clipboard") december <- read.delim("clipboard") november <- read.delim("clipboard") # add the city and date january$city <- rep("montreal", nrow(january)) january$date <- rep("2015-01-30", nrow(january)) february$city <- rep("montreal", nrow(february)) february$date <- rep("2015-02-15", nrow(february)) march$city <- rep("montreal", nrow(march)) march$date <- rep("2015-03-15", nrow(march)) december$city <- rep("montreal", nrow(december)) december$date <- rep("2014-12-15", nrow(december)) november$city <- rep("montreal", nrow(november)) november$date <- rep("2014-11-15", nrow(november)) # clean up the column names january %<>% janitor::clean_names() february %<>% janitor::clean_names() march %<>% janitor::clean_names() december %<>% janitor::clean_names() november %<>% janitor::clean_names() # reorder the columns xyz %<>% dplyr::select(team_name, team_size, start_time, end_time, total_time, puzzles, city, date) # rename this column to match temp1 xyz <- rename(xyz, puzzels_solved = puzzles) # change to a character xyz$puzzels_solved <- as.character(xyz$puzzels_solved) # add NA for team size xyz$team_size <- rep(NA, nrow(xyz)) # concatenate these five dataframes together xyz <- bind_rows(march, february, january, december, november) # convert characters into actual dates xyz$date <-as.Date(xyz$date) # concatenate this onto the larger dataframe temp2 <- bind_rows(temp1, xyz) # save the object write_csv(temp2, "puzzeld_pint_raw.csv")
Method # 2
Big thanks to Nate on SO for pointing out an alternative solution that solves the task of scrapping the tables but not the task of exception handling in the above sense.
# table yielding function # just for readability in the loop create_table <- function (remDr) { # parse html with XML package doc <- XML::htmlParse(remDr$getPageSource()[[1]]) Sys.sleep(3) # Extract data from HTML table in HTML document table_tmp <- XML::readHTMLTable(doc) Sys.sleep(3) # put this into a format you can use table <- table_tmp[[1]][-2, -1] # add a check-up for size mismatch table_fields <- as.character(t(table[1,])) if (! any(grepl("size", tolower(table_fields)))) { table <- table[-1, ] # rename the columns colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved") table$team_size <- NA_integer_ table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")] } else { table <- table[-1, ] # rename the columns colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved") } # add city name to a column table$city <- rep("Montreal", nrow(table)) # add the Month/Year this table was extracted from today <- Sys.Date() lubridate::month(today) <- lubridate::month(today)+1 table$date <- today # returns the table table } # II. Scrapping the content # # 1. selenium to generate the pages # 2. use create_table to extract the table # big_df <- NULL for (k in seq_along(sheet_url)) { # 1. navigate to the page remDr$navigate(sheet_url[k]) # remDr$screenshot(display = TRUE) maybe one wants to see progress table <- create_table(remDr) # 2. concatenate each table together big_df <- dplyr::bind_rows(big_df, table) # inform progress cat(paste0('\nGathered table for: \t', sheet_months[k])) } # close session remDr$close()
To perform the task, what was done was firstly generating the links to all spreadsheets in the document. To do this:
- Navigate once to the document
- Extract source code
- Extract the sheet months and URLs (via gid digit) using regex
- Once this is done, loop through the Urls, gather and bind the tables
There is a small function called create_table which returns the final table in the proper format with a safety measure for the number of columns (some of the spreadsheets do not have the team_size field - in those cases I set it to NA_integer).
If you find this article useful feel free to share it with others or recommend this article! ????
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! ????
Web Scraping Google Sheets with RSelenium was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.