[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Trying to answer this question on stackoverflow about understat.com scraping
I was interested to take RSelenium for a spin.
Few years ago, Selenium and R weren’t particularly friends (Python+Selenium were more used for instance) but
it seems to have changed.
Package author and rOpenSci works and documentation did it.
After few tries with xpath spellings, I have found RSelenium pretty nice actually.
I share here some recipes in this context: when you want to scrape a paginated table that is
not purely HTML but a result of embedded javascript execution in browser.
A thing that wans’t particularly easy in Selenium at the beginning was how to extract sub-elements like html table code and not “source page as a whole”.
I have used innerHTML attribute for this.
This example explains how emulate clicks can be done to navigate from elements to others in the HTML page, and a more focus point on moving from page to page in a paginated table.
Here is a youtube video with subtitles I have made to illustrate it (no voice).
First step to follow is to download a selenium-server-xxx.jar file here,
see this vignette.
and run in the terminal : java -jar selenium-server-standalone-xxx.jar
then you can inspect precisely elements of the HTML page code in browser and go back and forth between RStudio and the emulated browser (right click, inspect element)
at the end use rvest to parse html tables
for instance find an id like league-chemp that we are using with RSelenium: