Site icon R-bloggers

How to webscrape in R?

[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


In this lesson you will learn the basics of webscraping with the rvest R package. To demonstrate how it works, you will extract three speeches by Adolf Hitler from Wikipedia pages and analyze their word frequencies!


1. What is webscraping?

Simply put, webscraping is the process of gathering data on webpages. In its basic form, it consists of downloading the HTML code of a webpage, locating in which element of the HTML structure the content of interest is and, finally, extracting and storing it locally for further data analysis.


tips_and_updates  
Keep in mind that webscraping can be more complex if the target website uses JavaScript to render content. In this case, consider combining rvest with other libraries, as described here.




2. How to webscrape in R?

There are several libraries developed to webscrape in R. In this lesson, we will stick to one of the most popular, rvest. This library is part of the tidyverse set of libraries and allows you to use the pipe operator (%>%). It is inspired by Python’s Beautiful Soup and RoboBrowser. The basic steps for webscraping with rvest would involve using the following functions:

tips_and_updates  
There is a lot of debate on whether webscraping is ethical/legal or not. It depends a lot on where you are and the kind of content and purpose of your webscraping. Usually the robots.txt file of a website gives you hints about what is allowed and disallowed in a website. For more details on this debate, please check this link.


To illustrate how this works, we will extract the text of three speeches made by Adolt Hitler during the Second World War. The first step is to save the url of these speeches in a variable. We also load the necessary libraries. Please install them if you haven’t already done that.


content_copy Copy

library(rvest) # for webscraping
library(tidytext) # for cleaning text data
library(dplyr) # for data preparation
library(ggplot2) # for data viz

speech_01 <- "https://en.wikisource.org/wiki/Adolf_Hitler%27s_Address_at_the_Opening_of_the_Winter_Relief_Campaign_(4_September_1940)"
speech_02 <- "https://en.wikisource.org/wiki/Adolf_Hitler%27s_Address_to_the_Reichstag_(4_May_1941)"
speech_03 <-"https://en.wikisource.org/wiki/Adolf_Hitler%27s_Declaration_of_War_against_the_United_States"


Since we are going to extract the content of three speeches, it is a good idea to create a function to perform this task, since the same steps will repeat three times. If you inspect the URLs above, you will realize that the text content is located inside <p> (paragraph) tags. Therefore, our target is to extract these nodes. Note that in Firefox and Chrome, you can inspect a webpage by right clicking any area of the page and clicking “inspect”. For other browsers the procedure should be similar. If you have difficulty finding this option, please check the browser documentation.


Our read_speech function is pretty straightforward. The read_html reads the URL of the webpage and delivers the HTML of it. The pipe operator %>% passes the output of one function to the input of the next one. html_nodes extracts only paragraph tags from the code and, finally, html_text extracts the text from the paragraph tags.


content_copy Copy

read_speech <- function(url){
  speech <- read_html(url) %>% 
    html_nodes("p") %>% 
    html_text()

speech_04_Sep_40 <- read_speech(speech_01)
speech_04_May_41 <- read_speech(speech_02)
speech_11_Dec_41 <- read_speech(speech_03)

}


At this point, if you check the results, you will note that the function delivers a text vector in which each element of the vector is one paragraph. We still need to make some adjustments because the first paragraph is only a small presentation of the speech, rather than part of it. Therefore we should eliminate the first element of the vector. For the speech of 4th of September and the one of 11th December, that is all we need to do. If you print the speech of 4th of May, you will see that the last 5 elements are also metadata and need to be excluded. The code below uses indexing to filter the data accordingly. Moreover, we transform all the dataframes into tibble – a more modern kind of dataframe – to make it easier to prepare the data in the next steps.


content_copy Copy

speech_04_Sep_40 <- speech_04_Sep_40[2:60]
speech_04_May_41 <- speech_04_May_41[2:60]
speech_11_Dec_41 <- speech_11_Dec_41[2:155]

# tibble creates a modern kind of dataframe with two columns: paragraph and text
speech_04_Sep_40 <- tibble(paragraph = 1:59, text = speech_04_Sep_40) 
speech_04_May_41 <- tibble(paragraph = 1:59, text = speech_04_May_41)
speech_11_Dec_41 <- tibble(paragraph = 1:154, text = speech_11_Dec_41)




3. Visualizing the most frequent words in Hitler’s speeches

Our next objective is to visualize the top 10 words in each Hitler’s speech. In order to do that, we will first prepare the data, transforming the dataframes from the previous step to contain one word per row with its respective count. Note that we will eliminate stopwords – words with little meaning for the analysis, like articles.


A function called count_words will be created to carry out data preparation. This function will expand the dataframe from the paragraph level to the word level. This is done by unnest_tokens, which transforms the table to one-token-per-row. It takes the “text” column as input and outputs a “word” column. anti_join eliminates rows containing stopwords. If you print stopwords you can see exactly which words are being eliminated. Finally, count counts how many times each word occurs.


content_copy Copy

count_words <- function(speech){
    speech_count <- speech %>% 
    unnest_tokens(output = word, input = text) %>% 
    anti_join(stop_words) %>% 
    count(word, sort = TRUE) 
}

speech_04_Sep_40_count <- count_words(speech_04_Sep_40)
speech_04_May_41_count <- count_words(speech_04_May_41)
speech_11_Dec_41_count <- count_words(speech_11_Dec_41)


Great, now we can use ggplot2 to visualize the top 10 words in each speech. Note that we specify the dataframe of interest with index filtering to keep only the top 10 words. Note, as well, that we reorder the bar plot so that bar start from most to least frequent word. We choose a color and eliminate the y-axis label. The same can be done for the two other speeches.


content_copy Copy

ggplot(data = speech_04_Sep_40_count[1:10,], aes(n, reorder(word, n))) +
  geom_col(color = "#FF6885", fill ="#FF6885") +
  labs(y = NULL)





To add the same ggplot2 theme as used in these plots, please check theme_coding_the_past(), our theme that is available here: ‘Climate data visualization with ggplot2’.


Not surprisingly, “war” is a word that reaches the top 3 in all Hitler’s speeches. It is also interesting that other words refering to Britain, Balkans and Americans reflect the stage in which the war was. For example, in the speech of 11th of December, 1941, Hitler declares war on the US and therefore we observe a high frequency of words semantically related to the US. Please, leave your comment, questions or thoughts below and happy coding!




4. Conclusions




To leave a comment for the author, please follow the link and comment on their blog: coding-the-past.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version