Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
… one can read in the description of the second talk (by Krzysztof Słomczynski (krzyslom)) of the first meetup of the new Tri-City R Users Group – meet(R) in Tricity!. The meeting will be held on this Thursday (12-01-2017) at the Gdańsk University of Technology.
- RSelenium
- Launching RSelenium
- Logging to Ali Express
- Few basic RSelenium functions
- Extracting information from Ali Express allowed only for logged user
- Plotting expenses
RSelenium
Motivated by that upcoming talk I took a tour through RSelenium vignettes: RSelenium basics and RSelenium Docker to launch my first Docker container with Selenium Server. If you are not yet motivated to use Docker containers, then have a look at this post R 3.3.0 is another motivation for Docker.
RSelenium is an R interface that connects to Selenium (Server), which is a project focused on automating web browsers and enables to create a regular web browser session that can be controlled with command lines. Such Internet browsing automatization is a huge trigger for web-harvesting because with RSelenium you are able to simulate a real user and to pass keys to the browser session (such as user login and password). With such a possibility you are able to log into any portal automatically (or manually) and to fill bot security captchas code (this rather manually). You can also interact with web elements that first need to be clicked to show information which are in the demand to be web-scraped.
Launching RSelenium
I used Selenium Docker container to launch Selenium Server which is available on Docker hub. The following command launches Selenium Server and binds it to the localhost on the port 4445. It also binds the remote desktop on the port 5901 so that we can run VNC viewer (Vinagre) to observe operations performed on a fake web session.
When Docker container is running, we are able to establish a connection with Selenium Server from R with the following commands
Logging to Ali Express
The result of previous commands is a remoteDriver
object, in this situation called remDr
. This S4 object can be used to navigate through the web browser session. The below command navigates to the Ali Express logging page.
In this situation you can see than in the remote desktop VNC viewer we have entered Ali Express.
You can log in in 2 ways: by filling fields manually or by finding fields by their properties (like ir
or class
name) and by sending keys (like password
and user name
) to them.
Then you can click on the sign in
button with
inspect element
in any web browser, so that you will know how to navigate to the element with findElement
by it’s id
or by it’s class
or even css
. For this portal I wasn’t successful with sending keys so I logged manually.
However, for Facebook it worked like a charm
Few basic RSelenium functions
In this moment I think few comments about basics of RSelenium commands are required. With findElement
method you can get the first element on the web page that suits the searching criterion, with findElements
you can find all such elements. On each such element (or on the list of elements) you can perform further operations like
- sending data to element (
sendKeysToElement
) - clicking the element (
clickElement
) - finding elements within that element (
findChildsElement
) - extracting the text from element (
getElementText
) - highlighting the element (
highlightElement
)
and many more!
Extracting information from Ali Express allowed only for logged user
Ali Express is a market portal where you can order products mainly from China. They are of low quality but also of a cheap price, that’s why this portal is very popular, even though the long period of a home delivery (which is free in many cases). You can buy earrings
or clothes
for few cents!
How much money did I
🙂 spend for all my transactions?
From My orders
panel, allowed for a logged user I found out that for each page of orders (I had 32 pages of history of my transactions) I can extract whole body of a transaction, and then from that body I can extract the amount that I payed in dollars, check if the transaction wasn’t cancelled and get the ID of an operation. I just need to properly specify the names of classes of HTML tags/objects I need.
I will use order
and transaction
interchangeably.
For all 32 pages of orders’ history the following for loop
extracts information for each sub page and then navigates to the next sub page of orders’ history.
Plotting expenses
The extracted information is a plain text so it requires some text manipulation to achieve the tidy data, that can be plotted. Additionally I filter orders to those that haven’t been cancelled.
The plot of cumulative expenses can be obtained with the following code. I can’t believe that 450 $ was spend over 8 months! The result of the code is the main photo of this post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.