How to webscrape in R?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this lesson you will learn the basics of webscraping with the rvest
R package. To demonstrate how it works, you will extract three speeches by Adolf Hitler from Wikipedia pages and analyze their word frequencies!
1. What is webscraping?
Simply put, webscraping is the process of gathering data on webpages. In its basic form, it consists of downloading the HTML code of a webpage, locating in which element of the HTML structure the content of interest is and, finally, extracting and storing it locally for further data analysis.
2. How to webscrape in R?
There are several libraries developed to webscrape in R. In this lesson, we will stick to one of the most popular, rvest. This library is part of the tidyverse set of libraries and allows you to use the pipe operator (%>%). It is inspired by Python’s Beautiful Soup and RoboBrowser. The basic steps for webscraping with rvest would involve using the following functions:
- read_html: Extracts the HTML source code associated with an URL;
- html_nodes: Extracts the relevant HTML nodes from the HTML code;
- html_text: Extracts the text (content) from the nodes;
To illustrate how this works, we will extract the text of three speeches made by Adolt Hitler during the Second World War. The first step is to save the url of these speeches in a variable. We also load the necessary libraries. Please install them if you haven’t already done that.
content_copy Copy
Since we are going to extract the content of three speeches, it is a good idea to create a function to perform this task, since the same steps will repeat three times. If you inspect the URLs above, you will realize that the text content is located inside <p>
(paragraph) tags. Therefore, our target is to extract these nodes. Note that in Firefox and Chrome, you can inspect a webpage by right clicking any area of the page and clicking “inspect”. For other browsers the procedure should be similar. If you have difficulty finding this option, please check the browser documentation.
Our read_speech
function is pretty straightforward. The read_html
reads the URL of the webpage and delivers the HTML of it. The pipe operator %>%
passes the output of one function to the input of the next one. html_nodes
extracts only paragraph tags from the code and, finally, html_text
extracts the text from the paragraph tags.
content_copy Copy
At this point, if you check the results, you will note that the function delivers a text vector in which each element of the vector is one paragraph. We still need to make some adjustments because the first paragraph is only a small presentation of the speech, rather than part of it. Therefore we should eliminate the first element of the vector. For the speech of 4th of September and the one of 11th December, that is all we need to do. If you print the speech of 4th of May, you will see that the last 5 elements are also metadata and need to be excluded. The code below uses indexing to filter the data accordingly. Moreover, we transform all the dataframes into tibble - a more modern kind of dataframe - to make it easier to prepare the data in the next steps.
content_copy Copy
3. Visualizing the most frequent words in Hitler’s speeches
Our next objective is to visualize the top 10 words in each Hitler’s speech. In order to do that, we will first prepare the data, transforming the dataframes from the previous step to contain one word per row with its respective count. Note that we will eliminate stopwords - words with little meaning for the analysis, like articles.
A function called count_words
will be created to carry out data preparation. This function will expand the dataframe from the paragraph level to the word level. This is done by unnest_tokens
, which transforms the table to one-token-per-row. It takes the “text” column as input and outputs a “word” column. anti_join
eliminates rows containing stopwords. If you print stopwords you can see exactly which words are being eliminated. Finally, count
counts how many times each word occurs.
content_copy Copy
Great, now we can use ggplot2
to visualize the top 10 words in each speech. Note that we specify the dataframe of interest with index filtering to keep only the top 10 words. Note, as well, that we reorder the bar plot so that bar start from most to least frequent word. We choose a color and eliminate the y-axis label. The same can be done for the two other speeches.
content_copy Copy
- Top 10 words used in Hitler’s speech of 4th September 1940
- Top 10 words used in Hitler’s speech of 4th May 1941
- Top 10 words used in Hitler’s speech of 11th December 1941
To add the same ggplot2 theme as used in these plots, please check theme_coding_the_past()
, our theme that is available here: ‘Climate data visualization with ggplot2’.
Not surprisingly, “war” is a word that reaches the top 3 in all Hitler’s speeches. It is also interesting that other words refering to Britain, Balkans and Americans reflect the stage in which the war was. For example, in the speech of 11th of December, 1941, Hitler declares war on the US and therefore we observe a high frequency of words semantically related to the US. Please, leave your comment, questions or thoughts below and happy coding!
4. Conclusions
- R can be an effective tool to perform webscraping, notably with the
rvest
package; - To smoothly clean webscraped content, you may use the
tidytext
package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.