Practical Introduction to Web Scraping in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Are you trying to compare price of products across websites? Are you trying to monitor price changes every hour? Or planning to do some text mining or sentiment analysis on reviews of products or services? If yes, how would you do that? How do you get the details available on the website into a format in which you can analyse it?
- Can you copy/paste the data from their website?
- Can you see some save button?
- Can you download the data?
Hmmm.. If you have these or similar questions on your mind, you have come to the right place. In this post, we will learn about web scraping using R. Below is a video tutorial which covers the intial part of this post.
The slides used in the above video tutorial can be found here.
The What?
What exactly is web scraping or web mining or web harvesting? It is a technique for extracting data from websites. Remember, websites contain wealth of useful data but designed for human consumption and not data analysis. The goal of web scraping is to take advantage of the pattern or structure of web pages to extract and store data in a format suitable for data analysis.
The Why?
Now, let us understand why we may have to scrape data from the web.
- Data Format: As we said earlier, there is a wealth of data on websites but designed for human consumption. As such, we cannot use it for data analysis as it is not in a suitable format/shape/structure.
- No copy/paste: We cannot copy & paste the data into a local file. Even if we do it, it will not be in the required format for data analysis.
- No save/download: There are no options to save/download the required data from the websites. We cannot right click and save or click on a download button to extract the required data.
- Automation: With web scraping, we can automate the process of data extraction/harvesting.
The How?
- robots.txt: One of the most important and overlooked step is to check the robots.txt file to ensure that we have the permission to access the web page without violating any terms or conditions. In R, we can do this using the robotstxt by rOpenSci.
- Fetch: The next step is to fetch the web page using the xml2 package and store it so that we can extract the required data. Remember, you fetch the page once and store it to avoid fetching multiple times as it may lead to your IP address being blocked by the owners of the website.
- Extract/Store/Analyze: Now that we have fetched the web page, we will use rvest to extract the data and store it for further analysis.
Use Cases
Below are few use cases of web scraping:
- Contact Scraping: Locate contact information including email addresses, phone numbers etc.
- Monitoring/Comparing Prices: How your competitors price their products, how your prices fit within your industry, and whether there are any fluctuations that you can take advantage of.
- Scraping Reviews/Ratings: Scrape reviews of product/services and use it for text mining/sentiment analysis etc.
Things to keep in mind…
- Static & Well Structured: Web scraping is best suited for static & well structured web pages. In one of our case studies, we demonstrate how badly structured web pages can hamper data extraction.
- Code Changes: The underling HTML code of a web page can change anytime due to changes in design or for updating details. In such case, your script will stop working. It is important to identify changes to the web page and modify the web scraping script accordingly.
- API Availability: In many cases, an API (application programming interface) is made available by the service provider or organization. It is always advisable to use the API and avoid web scraping. The httr package has a nice introduction on interacting with APIs.
- IP Blocking: Do not flood websites with requests as you run the risk of getting blocked. Have some time gap between request so that your IP address in not blocked from accessing the website.
- robots.txt: We cannot emphasize this enough, always review the robots.txt file to ensure you are not violating any terms and conditions.
Case Studies
- IMDB top 50 movies: In this case study we will scrape the IMDB website to extract the title, year of release, certificate, runtime, genre, rating, votes and revenue of the top 50 movies.
- Most visited websites: In this case study, we will look at the 50 most visited websites in the world including the category to which they belong, average time on site, average pages browsed per vist and bounce rate.
- List of RBI governors : In this final case study, we will scrape the list of RBI Governors from Wikipedia, and analyze the background from which they came i.e whether there were more economists or bureaucrats?
HTML Basics
To be able to scrape data from websites, we need to understand how the web pages are structured. In this section, we will learn just enough HTML to be able to start scraping data from websites.
HTML, CSS & JAVASCRIPT
A web page typically is made up of the following:
- HTML (Hyper Text Markup Language) takes care of the content. You need to have a basic knowledge of HTML tags as the content is located with these tags.
- CSS (Cascading Style Sheets) takes care of the appearance of the content. While you don’t need to look into the CSS of a web page, you should be able to identify the id or class that manage the appearance of content.
- JS (Javascript) takes care of the behavior of the web page.
HTML Element
HTML element consists of a start tag and end tag with content inserted in between. They can be nested and are case insensitive. The tags can have attributes as shown in the above image. The attributes usually come as name/value pairs. In the above image, class is the attribute name while primary is the attribute value. While scraping data from websites in the case study, we will use a combination of HTML tags and attributes to locate the content we want to extract. Below is a list of basic and important HTML tags you should know before you get started with web scraping.
DOM
DOM (Document Object Model) defines the logical structure of a document and the way it is accessed and manipulated. In the above image, you can see that HTML is structured as a tree and you trace path to any node or tag. We will use a similar approach in our case studies. We will try to trace the content we intend to extract using HTML tags and attributes. If the web page is well structured, we should be able to locate the content using a unique combination of tags and attributes.
HTML Attributes
- all HTML elements can have attributes
- they provide additional information about an element
- they are always specified in the start tag
- usually come in name/value pairs
The class attribute is used to define equal styles for elements with same
class name. HTML elements with same class name will have the same format and
style. The id
attribute specifies a unique id for an HTML element. It can be
used on any HTML element and is case sensitive. The style
attribute sets the
style of an HTML element.
Libraries
We will use the following R packages in this tutorial.
library(robotstxt) library(rvest) library(selectr) library(xml2) library(dplyr) library(stringr) library(forcats) library(magrittr) library(tidyr) library(ggplot2) library(lubridate) library(tibble) library(purrr)
IMDB Top 50
In this case study, we will extract the following details of the top 50 movies from the IMDB website:
- title
- year of release
- certificate
- runtime
- genre
- rating
- votes
- revenue
robotstxt
Let us check if we can scrape the data from the website using paths_allowed()
from robotstxt package.
paths_allowed( paths = c("https://www.imdb.com/search/title?groups=top_250&sort=user_rating") ) ## www.imdb.com No encoding supplied: defaulting to UTF-8. ## [1] TRUE
Since it has returned TRUE
, we will go ahead and download the web page using
read_html()
from xml2 package.
imdb <- read_html("https://www.imdb.com/search/title?groups=top_250&sort=user_rating") imdb ## {xml_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... ## [2] <body id="styleguide-v2" class="fixed">\n\n <img height=" ...
Title
As we did in the previous case study, we will look at the HTML code of the IMDB web page and locate the title of the movies in the following way:
- hyperlink inside
<h3>
tag - section identified with the class
.lister-item-content
In other words, the title of the movie is inside a hyperlink (<a>
) which
is inside a level 3 heading (<h3>
) within a section identified by the class
.lister-item-content
.
imdb %>% html_nodes(".lister-item-content h3 a") %>% html_text() -> movie_title movie_title ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Dark Knight" ## [4] "The Godfather: Part II" ## [5] "The Lord of the Rings: The Return of the King" ## [6] "Pulp Fiction" ## [7] "Schindler's List" ## [8] "Il buono, il brutto, il cattivo" ## [9] "12 Angry Men" ## [10] "Inception" ## [11] "Fight Club" ## [12] "The Lord of the Rings: The Fellowship of the Ring" ## [13] "Forrest Gump" ## [14] "The Lord of the Rings: The Two Towers" ## [15] "The Matrix" ## [16] "Goodfellas" ## [17] "Star Wars: Episode V - The Empire Strikes Back" ## [18] "One Flew Over the Cuckoo's Nest" ## [19] "Shichinin no samurai" ## [20] "Interstellar" ## [21] "Cidade de Deus" ## [22] "Sen to Chihiro no kamikakushi" ## [23] "Saving Private Ryan" ## [24] "The Green Mile" ## [25] "La vita è bella" ## [26] "The Usual Suspects" ## [27] "Se7en" ## [28] "Léon" ## [29] "The Silence of the Lambs" ## [30] "Star Wars" ## [31] "It's a Wonderful Life" ## [32] "Andhadhun" ## [33] "Dangal" ## [34] "Spider-Man: Into the Spider-Verse" ## [35] "Avengers: Infinity War" ## [36] "Whiplash" ## [37] "The Intouchables" ## [38] "The Prestige" ## [39] "The Departed" ## [40] "The Pianist" ## [41] "Memento" ## [42] "Gladiator" ## [43] "American History X" ## [44] "The Lion King" ## [45] "Terminator 2: Judgment Day" ## [46] "Nuovo Cinema Paradiso" ## [47] "Hotaru no haka" ## [48] "Back to the Future" ## [49] "Raiders of the Lost Ark" ## [50] "Apocalypse Now"
Year of Release
The year in which a movie was released can be located in the following way:
<span>
tag identified by the class.lister-item-year
- nested inside a level 3 heading (
<h3>
) - part of section identified by the class
.lister-item-content
imdb %>% html_nodes(".lister-item-content h3 .lister-item-year") %>% html_text() ## [1] "(1994)" "(1972)" "(2008)" "(1974)" "(2003)" "(1994)" "(1993)" ## [8] "(1966)" "(1957)" "(2010)" "(1999)" "(2001)" "(1994)" "(2002)" ## [15] "(1999)" "(1990)" "(1980)" "(1975)" "(1954)" "(2014)" "(2002)" ## [22] "(2001)" "(1998)" "(1999)" "(1997)" "(1995)" "(1995)" "(1994)" ## [29] "(1991)" "(1977)" "(1946)" "(2018)" "(2016)" "(2018)" "(2018)" ## [36] "(2014)" "(2011)" "(2006)" "(2006)" "(2002)" "(2000)" "(2000)" ## [43] "(1998)" "(1994)" "(1991)" "(1988)" "(1988)" "(1985)" "(1981)" ## [50] "(1979)"
If you look at the output, the year is enclosed in round brackets and is a character vector. We need to do 2 things now:
- remove the round bracket
- convert year to class
Date
instead of character
We will use str_sub()
to extract the year and convert it to Date
using
as.Date()
with the format %Y
. Finally, we use year()
from lubridate
package to extract the year from the previous step.
imdb %>% html_nodes(".lister-item-content h3 .lister-item-year") %>% html_text() %>% str_sub(start = 2, end = 5) %>% as.Date(format = "%Y") %>% year() -> movie_year movie_year ## [1] 1994 1972 2008 1974 2003 1994 1993 1966 1957 2010 1999 2001 1994 2002 ## [15] 1999 1990 1980 1975 1954 2014 2002 2001 1998 1999 1997 1995 1995 1994 ## [29] 1991 1977 1946 2018 2016 2018 2018 2014 2011 2006 2006 2002 2000 2000 ## [43] 1998 1994 1991 1988 1988 1985 1981 1979
Certificate
The certificate given to the movie can be located in the following way:
<span>
tag identified by the class.certificate
- nested inside a paragraph (
<p>
) - part of section identified by the class
.lister-item-content
imdb %>% html_nodes(".lister-item-content p .certificate") %>% html_text() -> movie_certificate movie_certificate ## [1] "A" "A" "UA" "PG-13" "A" "A" "UA" "A" ## [9] "PG-13" "PG-13" "PG-13" "A" "A" "PG" "UA" "R" ## [17] "PG" "A" "A" "PG-13" "A" "R" "A" "A" ## [25] "U" "PG" "UA" "U" "U" "UA" "A" "UA" ## [33] "PG-13" "A" "R" "R" "R" "A" "U" "U" ## [41] "R" "U" "PG" "R"
Runtime
The runtime of the movie can be located in the following way:
<span>
tag identified by the class.runtime
- nested inside a paragraph (
<p>
) - part of section identified by the class
.lister-item-content
imdb %>% html_nodes(".lister-item-content p .runtime") %>% html_text() ## [1] "142 min" "175 min" "152 min" "202 min" "201 min" "154 min" "195 min" ## [8] "161 min" "96 min" "148 min" "139 min" "178 min" "142 min" "179 min" ## [15] "136 min" "146 min" "124 min" "133 min" "207 min" "169 min" "130 min" ## [22] "125 min" "169 min" "189 min" "116 min" "106 min" "127 min" "110 min" ## [29] "118 min" "121 min" "130 min" "139 min" "161 min" "117 min" "149 min" ## [36] "106 min" "112 min" "130 min" "151 min" "150 min" "113 min" "155 min" ## [43] "119 min" "88 min" "137 min" "155 min" "89 min" "116 min" "115 min" ## [50] "147 min"
If you look at the output, it includes the text min and is of type
character
. We need to do 2 things here:
- remove the text min
- convert to type
numeric
We will try the following:
- use
str_split()
to split the result using space as a separator - extract the first element from the resulting list using
map_chr()
- use
as.numeric()
to convert to a number
imdb %>% html_nodes(".lister-item-content p .runtime") %>% html_text() %>% str_split(" ") %>% map_chr(1) %>% as.numeric() -> movie_runtime movie_runtime ## [1] 142 175 152 202 201 154 195 161 96 148 139 178 142 179 136 146 124 ## [18] 133 207 169 130 125 169 189 116 106 127 110 118 121 130 139 161 117 ## [35] 149 106 112 130 151 150 113 155 119 88 137 155 89 116 115 147
Genre
The genre of the movie can be located in the following way:
<span>
tag identified by the class.genre
- nested inside a paragraph (
<p>
) - part of section identified by the class
.lister-item-content
imdb %>% html_nodes(".lister-item-content p .genre") %>% html_text() ## [1] "\nDrama " ## [2] "\nCrime, Drama " ## [3] "\nAction, Crime, Drama " ## [4] "\nCrime, Drama " ## [5] "\nAdventure, Drama, Fantasy " ## [6] "\nCrime, Drama " ## [7] "\nBiography, Drama, History " ## [8] "\nWestern " ## [9] "\nDrama " ## [10] "\nAction, Adventure, Sci-Fi " ## [11] "\nDrama " ## [12] "\nAdventure, Drama, Fantasy " ## [13] "\nDrama, Romance " ## [14] "\nAdventure, Drama, Fantasy " ## [15] "\nAction, Sci-Fi " ## [16] "\nBiography, Crime, Drama " ## [17] "\nAction, Adventure, Fantasy " ## [18] "\nDrama " ## [19] "\nAdventure, Drama " ## [20] "\nAdventure, Drama, Sci-Fi " ## [21] "\nCrime, Drama " ## [22] "\nAnimation, Adventure, Family " ## [23] "\nDrama, War " ## [24] "\nCrime, Drama, Fantasy " ## [25] "\nComedy, Drama, Romance " ## [26] "\nCrime, Mystery, Thriller " ## [27] "\nCrime, Drama, Mystery " ## [28] "\nAction, Crime, Drama " ## [29] "\nCrime, Drama, Thriller " ## [30] "\nAction, Adventure, Fantasy " ## [31] "\nDrama, Family, Fantasy " ## [32] "\nCrime, Thriller " ## [33] "\nAction, Biography, Drama " ## [34] "\nAnimation, Action, Adventure " ## [35] "\nAction, Adventure, Sci-Fi " ## [36] "\nDrama, Music " ## [37] "\nBiography, Comedy, Drama " ## [38] "\nDrama, Mystery, Sci-Fi " ## [39] "\nCrime, Drama, Thriller " ## [40] "\nBiography, Drama, Music " ## [41] "\nMystery, Thriller " ## [42] "\nAction, Adventure, Drama " ## [43] "\nDrama " ## [44] "\nAnimation, Adventure, Drama " ## [45] "\nAction, Sci-Fi " ## [46] "\nDrama " ## [47] "\nAnimation, Drama, War " ## [48] "\nAdventure, Comedy, Sci-Fi " ## [49] "\nAction, Adventure " ## [50] "\nDrama, War "
The output includes \n
and white space, both of which will be removed using
str_trim()
.
imdb %>% html_nodes(".lister-item-content p .genre") %>% html_text() %>% str_trim() -> movie_genre movie_genre ## [1] "Drama" "Crime, Drama" ## [3] "Action, Crime, Drama" "Crime, Drama" ## [5] "Adventure, Drama, Fantasy" "Crime, Drama" ## [7] "Biography, Drama, History" "Western" ## [9] "Drama" "Action, Adventure, Sci-Fi" ## [11] "Drama" "Adventure, Drama, Fantasy" ## [13] "Drama, Romance" "Adventure, Drama, Fantasy" ## [15] "Action, Sci-Fi" "Biography, Crime, Drama" ## [17] "Action, Adventure, Fantasy" "Drama" ## [19] "Adventure, Drama" "Adventure, Drama, Sci-Fi" ## [21] "Crime, Drama" "Animation, Adventure, Family" ## [23] "Drama, War" "Crime, Drama, Fantasy" ## [25] "Comedy, Drama, Romance" "Crime, Mystery, Thriller" ## [27] "Crime, Drama, Mystery" "Action, Crime, Drama" ## [29] "Crime, Drama, Thriller" "Action, Adventure, Fantasy" ## [31] "Drama, Family, Fantasy" "Crime, Thriller" ## [33] "Action, Biography, Drama" "Animation, Action, Adventure" ## [35] "Action, Adventure, Sci-Fi" "Drama, Music" ## [37] "Biography, Comedy, Drama" "Drama, Mystery, Sci-Fi" ## [39] "Crime, Drama, Thriller" "Biography, Drama, Music" ## [41] "Mystery, Thriller" "Action, Adventure, Drama" ## [43] "Drama" "Animation, Adventure, Drama" ## [45] "Action, Sci-Fi" "Drama" ## [47] "Animation, Drama, War" "Adventure, Comedy, Sci-Fi" ## [49] "Action, Adventure" "Drama, War"
Rating
The rating of the movie can be located in the following way:
- part of the section identified by the class
.ratings-imdb-rating
- nested within the section identified by the class
.ratings-bar
- the rating is present within the
<strong>
tag as well as in thedata-value
attribute - instead of using
html_text()
, we will usehtml_attr()
to extract the value of the attributedata-value
Try using html_text()
and see what happens! You may include the <strong>
tag
or the classes associated with <span>
tag as well.
imdb %>% html_nodes(".ratings-bar .ratings-imdb-rating") %>% html_attr("data-value") ## [1] "9.3" "9.2" "9" "9" "8.9" "8.9" "8.9" "8.9" "8.9" "8.8" "8.8" ## [12] "8.8" "8.8" "8.7" "8.7" "8.7" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" ## [23] "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" ## [34] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [45] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5"
Since rating is returned as a character vector, we will use as.numeric()
to
convert it into a number.
imdb %>% html_nodes(".ratings-bar .ratings-imdb-rating") %>% html_attr("data-value") %>% as.numeric() -> movie_rating movie_rating ## [1] 9.3 9.2 9.0 9.0 8.9 8.9 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.7 ## [18] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 ## [35] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
XPATH
To extract votes from the web page, we will use a different technique. In this case, we will use xpath and attributes to locate the total number of votes received by the top 50 movies.
xpath is specified using the following:
- tab
- attribute name
- attribute value
Votes
In case of votes, they are the following:
meta
ratingCount
Next, we are not looking to extract text value as we did in the previous examples
using html_text()
. Here, we need to extract the value assigned to the
content
attribute within the <meta>
tag using html_attr()
.
imdb %>% html_nodes(xpath = '//meta[@="ratingCount"]') %>% html_attr('content') ## [1] "2072893" "1422292" "2038787" "987020" "1475650" "1621033" "1074273" ## [8] "615219" "585562" "1817393" "1658750" "1492209" "1589127" "1334563" ## [15] "1489071" "895033" "1040130" "822277" "280024" "1276946" "637716" ## [22] "549410" "1096231" "1000909" "545280" "897576" "1271530" "913352" ## [29] "1118817" "1109777" "352837" "39132" "118413" "174125" "617621" ## [36] "605417" "666327" "1052901" "1064050" "633675" "1021511" "1198326" ## [43] "941917" "823238" "897607" "198398" "192715" "923178" "803033" ## [50] "542311"
Finally, we convert the votes to a number using as.numeric()
.
imdb %>% html_nodes(xpath = '//meta[@="ratingCount"]') %>% html_attr('content') %>% as.numeric() -> movie_votes movie_votes ## [1] 2072893 1422292 2038787 987020 1475650 1621033 1074273 615219 ## [9] 585562 1817393 1658750 1492209 1589127 1334563 1489071 895033 ## [17] 1040130 822277 280024 1276946 637716 549410 1096231 1000909 ## [25] 545280 897576 1271530 913352 1118817 1109777 352837 39132 ## [33] 118413 174125 617621 605417 666327 1052901 1064050 633675 ## [41] 1021511 1198326 941917 823238 897607 198398 192715 923178 ## [49] 803033 542311
Revenue
We wanted to extract both revenue and votes without using xpath but the way in which they are structured in the HTML code forced us to use xpath to extract votes. If you look at the HTML code, both votes and revenue are located inside the same tag with the same attribute name and value i.e. there is no distinct way to identify either of them.
In case of revenue, the xpath details are as follows:
<span>
name
nv
Next, we will use html_text()
to extract the revenue.
imdb %>% html_nodes(xpath = '//span[@name="nv"]') %>% html_text() ## [1] "2,072,893" "$28.34M" "1,422,292" "$134.97M" "2,038,787" ## [6] "$534.86M" "987,020" "$57.30M" "1,475,650" "$377.85M" ## [11] "1,621,033" "$107.93M" "1,074,273" "$96.07M" "615,219" ## [16] "$6.10M" "585,562" "$4.36M" "1,817,393" "$292.58M" ## [21] "1,658,750" "$37.03M" "1,492,209" "$315.54M" "1,589,127" ## [26] "$330.25M" "1,334,563" "$342.55M" "1,489,071" "$171.48M" ## [31] "895,033" "$46.84M" "1,040,130" "$290.48M" "822,277" ## [36] "$112.00M" "280,024" "$0.27M" "1,276,946" "$188.02M" ## [41] "637,716" "$7.56M" "549,410" "$10.06M" "1,096,231" ## [46] "$216.54M" "1,000,909" "$136.80M" "545,280" "$57.60M" ## [51] "897,576" "$23.34M" "1,271,530" "$100.13M" "913,352" ## [56] "$19.50M" "1,118,817" "$130.74M" "1,109,777" "$322.74M" ## [61] "352,837" "39,132" "$1.19M" "118,413" "$12.39M" ## [66] "174,125" "$190.24M" "617,621" "$678.82M" "605,417" ## [71] "$13.09M" "666,327" "$13.18M" "1,052,901" "$53.09M" ## [76] "1,064,050" "$132.38M" "633,675" "$32.57M" "1,021,511" ## [81] "$25.54M" "1,198,326" "$187.71M" "941,917" "$6.72M" ## [86] "823,238" "$312.90M" "897,607" "$204.84M" "198,398" ## [91] "$11.99M" "192,715" "923,178" "$210.61M" "803,033" ## [96] "$248.16M" "542,311" "$83.47M"
To extract the revenue as a number, we need to do some string hacking as follows:
- extract values that begin with
$
- omit missing values
- convert values to character using
as.character()
- append NA where revenue is missing (rank 31 and 47)
- remove
$
andM
- convert to number using
as.numeric()
imdb %>% html_nodes(xpath = '//span[@name="nv"]') %>% html_text() %>% str_extract(pattern = "^\\$.*") %>% na.omit() %>% as.character() %>% append(values = NA, after = 30) %>% append(values = NA, after = 46) %>% str_sub(start = 2, end = nchar(.) - 1) %>% as.numeric() -> movie_revenue movie_revenue ## [1] 28.34 134.97 534.86 57.30 377.85 107.93 96.07 6.10 4.36 292.58 ## [11] 37.03 315.54 330.25 342.55 171.48 46.84 290.48 112.00 0.27 188.02 ## [21] 7.56 10.06 216.54 136.80 57.60 23.34 100.13 19.50 130.74 322.74 ## [31] NA 1.19 12.39 190.24 678.82 13.09 13.18 53.09 132.38 32.57 ## [41] 25.54 187.71 6.72 312.90 204.84 11.99 NA 210.61 248.16 83.47
Putting it all together…
top_50 <- tibble(title = movie_title, release = movie_year, `runtime (mins)` = movie_runtime, genre = movie_genre, rating = movie_rating, votes = movie_votes, `revenue ($ millions)` = movie_revenue) top_50 ## # A tibble: 50 x 7 ## title release `runtime (mins)` genre rating votes `revenue ($ mil~ ## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 The Sha~ 1994 142 Drama 9.3 2.07e6 28.3 ## 2 The God~ 1972 175 Crime,~ 9.2 1.42e6 135. ## 3 The Dar~ 2008 152 Action~ 9 2.04e6 535. ## 4 The God~ 1974 202 Crime,~ 9 9.87e5 57.3 ## 5 The Lor~ 2003 201 Advent~ 8.9 1.48e6 378. ## 6 Pulp Fi~ 1994 154 Crime,~ 8.9 1.62e6 108. ## 7 Schindl~ 1993 195 Biogra~ 8.9 1.07e6 96.1 ## 8 Il buon~ 1966 161 Western 8.9 6.15e5 6.1 ## 9 12 Angr~ 1957 96 Drama 8.9 5.86e5 4.36 ## 10 Incepti~ 2010 148 Action~ 8.8 1.82e6 293. ## # ... with 40 more rows
Top Websites
Unfortunately, we had to drop this case study as the HTML code changed while we were working on this blog post. Remember, the third point we mentioned in the things to keep in mind, where we had warned that the design or underlying HTML code of the website may change. It just happened as we were finalizing this post.
RBI Governors
In this case study, we are going to extract the list of RBI (Reserve Bank of India) Governors. The author of this blog post comes from an Economics background and as such was intereseted in knowing the professional background of the Governors prior to their taking charge at India’s central bank. We will extact the following details:
- name
- start of term
- end of term
- term (in days)
- background
robotstxt
Let us check if we can scrape the data from Wikipedia website using
paths_allowed()
from robotstxt package.
paths_allowed( paths = c("https://en.wikipedia.org/wiki/List_of_Governors_of_Reserve_Bank_of_India") ) ## en.wikipedia.org ## [1] TRUE
Since it has returned TRUE
, we will go ahead and download the web page using
read_html()
from xml2 package.
rbi_guv <- read_html("https://en.wikipedia.org/wiki/List_of_Governors_of_Reserve_Bank_of_India") rbi_guv ## {xml_document} ## <html class="client-nojs" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... ## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
List of Governors
The data in the Wikipedia page is luckily structured as a table and we can
extract it using html_table()
.
rbi_guv %>% html_nodes("table") %>% html_table() ## [[1]] ## Governor of the Reserve Bank of India ## 1 IncumbentShaktikanta Das, IASsince 12 December 2018; 3 months ago (2018-12-12) ## 2 Appointer ## 3 Term length ## 4 Constituting instrument ## 5 Inaugural holder ## 6 Formation ## 7 Deputy ## 8 Website ## Governor of the Reserve Bank of India ## 1 IncumbentShaktikanta Das, IASsince 12 December 2018; 3 months ago (2018-12-12) ## 2 Appointments Committee of the Cabinet ## 3 Three years ## 4 Reserve Bank of India Act, 1934 ## 5 Osborne Smith (1935–1937) ## 6 1 April 1935; 84 years ago (1935-04-01) ## 7 Deputy Governors of the Reserve Bank of India ## 8 rbi.org.in ## ## [[2]] ## No. Officeholder Portrait Term start Term end ## 1 1 Osborne Smith NA 1 April 1935 30 June 1937 ## 2 2 James Braid Taylor NA 1 July 1937 17 February 1943 ## 3 3 C. D. Deshmukh NA 11 August 1943ii 30 May 1949 ## 4 4 Benegal Rama Rau NA 1 July 1949 14 January 1957 ## 5 5 K. G. Ambegaonkar NA 14 January 1957 28 February 1957 ## 6 6 H. V. R. Iyengar NA 1 March 1957 28 February 1962 ## 7 7 P. C. Bhattacharya NA 1 March 1962 30 June 1967 ## 8 8 Lakshmi Kant Jha NA 1 July 1967 3 May 1970 ## 9 9 B. N. Adarkar NA 4 May 1970 15 June 1970 ## 10 10 Sarukkai Jagannathan NA 16 June 1970 19 May 1975 ## 11 11 N. C. Sen Gupta NA 19 May 1975 19 August 1975 ## 12 12 K. R. Puri NA 20 August 1975 2 May 1977 ## 13 13 M. Narasimham NA 3 May 1977 30 November 1977 ## 14 14 I. G. Patel NA 1 December 1977 15 September 1982 ## 15 15 Manmohan Singh NA 16 September 1982 14 January 1985 ## 16 16 Amitav Ghosh NA 15 January 1985 4 February 1985 ## 17 17 R. N. Malhotra NA 4 February 1985 22 December 1990 ## 18 18 S. Venkitaramanan NA 22 December 1990 21 December 1992 ## 19 19 C. Rangarajan NA 22 December 1992 21 November 1997 ## 20 20 Bimal Jalan NA 22 November 1997 6 September 2003 ## 21 21 Y. Venugopal Reddy NA 6 September 2003 5 September 2008 ## 22 22 D. Subbarao NA 5 September 2008 4 September 2013 ## 23 23 Raghuram Rajan NA 4 September 2013 4 September 2016 ## 24 24 Urjit Patel NA 4 September 2016 11 December 2018 ## 25 25 Shaktikanta Das NA 12 December 2018 Incumbent ## Term in office Background ## 1 821 days Banker ## 2 2057 days Indian Civil Service (ICS) officer ## 3 2150 days ICS officer ## 4 2754 days ICS officer ## 5 45 days ICS officer ## 6 1825 days ICS officer ## 7 1947 days Indian Audit and Accounts Service officer ## 8 1037 days ICS officer ## 9 42 days Economist ## 10 1798 days ICS officer ## 11 92 days ICS officer ## 12 621 days ## 13 211 days Career Reserve Bank of India officer ## 14 1749 days Economist ## 15 851 days Economist ## 16 20 days Banker ## 17 2147 days Indian Administrative Service (IAS) officer ## 18 730 days IAS officer ## 19 1795 days Economist ## 20 2114 days Economist ## 21 1826 days IAS officer ## 22 1825 days IAS officer ## 23 1096 days Economist ## 24 947 days Economist ## 25 118 days IAS officer ## Prior office(s) ## 1 Managing Governor of the Imperial Bank of India ## 2 Deputy Governor of the Reserve Bank of India\n\nController of Currency ## 3 Deputy Governor of the Reserve Bank of India\nCustodian of Enemy Property ## 4 Ambassador of India to the United States\n\nAmbassador of India to Japan\n\nChairman of Bombay Port Trust ## 5 Finance Secretary ## 6 Chairman of the State Bank of India ## 7 Chairman of the State Bank of India\nSecretary in the Ministry of Finance ## 8 Secretary to the Prime Minister of India ## 9 Executive Director at the International Monetary Fund ## 10 Executive Director at the World Bank ## 11 Banking Secretary ## 12 Chairman and Managing Director of the Life Insurance Corporation ## 13 Deputy Governor of the Reserve Bank of India ## 14 Director of the London School of Economics\n\nDeputy Administrator of the United Nations Development Programme\nChief Economic Adviser to the Government of India ## 15 Secretary in the Ministry of Finance\n\nChief Economic Adviser to the Government of India ## 16 Deputy Governor of the Reserve Bank of India\n\nChairman of the Allahabad Bank ## 17 Finance Secretary\n\nExecutive Director at the International Monetary Fund ## 18 Finance Secretary ## 19 Deputy Governor of the Reserve Bank of India ## 20 Finance Secretary\n\nBanking Secretary\n\nChief Economic Adviser to the Government of India ## 21 Executive Director at the International Monetary Fund\n\nDeputy Governor of the Reserve Bank of India ## 22 Finance Secretary\n\nMember-Secretary of the Prime Minister's Economic Advisory Council ## 23 Chief Economic Adviser to the Government of India ## 24 Deputy Governor of the Reserve Bank ## 25 Member of the Fifteenth Finance Commission\nSherpa of India to the G20\nEconomic Affairs Secretary\nRevenue Secretary ## Reference(s) ## 1 [1] ## 2 [2] ## 3 ## 4 ## 5 ## 6 ## 7 ## 8 ## 9 ## 10 ## 11 ## 12 ## 13 ## 14 ## 15 ## 16 ## 17 ## 18 ## 19 ## 20 ## 21 ## 22 ## 23 ## 24 ## 25 [3][4][5] ## ## [[3]] ## vte Governors of the Reserve Bank of India ## 1 NA ## vte Governors of the Reserve Bank of India ## 1 Osborne Smith (1935–37)\nJames Braid Taylor (1937–43)\nC. D. Deshmukh (1943–49)\nBenegal Rama Rau (1949–57)\nK. G. Ambegaonkar (1957)\nH. V. R. Iyengar (1957–62)\nP. C. Bhattacharya (1962–67)\nLakshmi Kant Jha (1967–70)\nB. N. Adarkar (1970)\nS. Jagannathan (1970–75)\nN. C. Sen Gupta (1975)\nK. R. Puri (1975–77)\nM. Narasimham (1977)\nI. G. Patel (1977–82)\nManmohan Singh (1982–85)\nAmitav Ghosh (1985)\nR. N. Malhotra (1985–90)\nS. Venkitaramanan (1990–92)\nC. Rangarajan (1992–97)\nBimal Jalan (1997–2003)\nY. Venugopal Reddy (2003–08)\nDuvvuri Subbarao (2008–13)\nRaghuram Rajan (2013–16)\nUrjit Patel (2016–2018)\nShaktikanta Das (2018–Incumbent) ## vte Governors of the Reserve Bank of India ## 1 Osborne Smith (1935–37)\nJames Braid Taylor (1937–43)\nC. D. Deshmukh (1943–49)\nBenegal Rama Rau (1949–57)\nK. G. Ambegaonkar (1957)\nH. V. R. Iyengar (1957–62)\nP. C. Bhattacharya (1962–67)\nLakshmi Kant Jha (1967–70)\nB. N. Adarkar (1970)\nS. Jagannathan (1970–75)\nN. C. Sen Gupta (1975)\nK. R. Puri (1975–77)\nM. Narasimham (1977)\nI. G. Patel (1977–82)\nManmohan Singh (1982–85)\nAmitav Ghosh (1985)\nR. N. Malhotra (1985–90)\nS. Venkitaramanan (1990–92)\nC. Rangarajan (1992–97)\nBimal Jalan (1997–2003)\nY. Venugopal Reddy (2003–08)\nDuvvuri Subbarao (2008–13)\nRaghuram Rajan (2013–16)\nUrjit Patel (2016–2018)\nShaktikanta Das (2018–Incumbent) ## vte Governors of the Reserve Bank of India ## 1 NA
There are 2 tables in the web page and we are interested in the second table.
Using extract2()
from the magrittr package, we will extract the table
containing the details of the Governors.
rbi_guv %>% html_nodes("table") %>% html_table() %>% extract2(2) -> profile
Sort
Let us arrange the data by number of days served. The Term in office
column
contains this information but it also includes the text days. Let us split this
column into two columns, term
and days
, using separate()
from tidyr and
then select the columns Officeholder
and term
and arrange it in descending
order using desc()
.
profile %>% separate(`Term in office`, into = c("term", "days")) %>% select(Officeholder, term) %>% arrange(desc(as.numeric(term))) ## Officeholder term ## 1 Benegal Rama Rau 2754 ## 2 C. D. Deshmukh 2150 ## 3 R. N. Malhotra 2147 ## 4 Bimal Jalan 2114 ## 5 James Braid Taylor 2057 ## 6 P. C. Bhattacharya 1947 ## 7 Y. Venugopal Reddy 1826 ## 8 H. V. R. Iyengar 1825 ## 9 D. Subbarao 1825 ## 10 Sarukkai Jagannathan 1798 ## 11 C. Rangarajan 1795 ## 12 I. G. Patel 1749 ## 13 Raghuram Rajan 1096 ## 14 Lakshmi Kant Jha 1037 ## 15 Urjit Patel 947 ## 16 Manmohan Singh 851 ## 17 Osborne Smith 821 ## 18 S. Venkitaramanan 730 ## 19 K. R. Puri 621 ## 20 M. Narasimham 211 ## 21 Shaktikanta Das 118 ## 22 N. C. Sen Gupta 92 ## 23 K. G. Ambegaonkar 45 ## 24 B. N. Adarkar 42 ## 25 Amitav Ghosh 20
Backgrounds
What we are interested is in the background of the Governors? Use count()
from dplyr to look at the backgound of the Governors and the respective
counts.
profile %>% count(Background) ## # A tibble: 9 x 2 ## Background n ## <chr> <int> ## 1 "" 1 ## 2 Banker 2 ## 3 Career Reserve Bank of India officer 1 ## 4 Economist 7 ## 5 IAS officer 4 ## 6 ICS officer 7 ## 7 Indian Administrative Service (IAS) officer 1 ## 8 Indian Audit and Accounts Service officer 1 ## 9 Indian Civil Service (ICS) officer 1
Let us club some of the categories into Bureaucrats as they belong to the
Indian Administrative/Civil Services. The missing data will be renamed as No Info
.
The category Career Reserve Bank of India officer
is renamed as RBI Officer
to make it more concise.
profile %>% pull(Background) %>% fct_collapse( Bureaucrats = c("IAS officer", "ICS officer", "Indian Administrative Service (IAS) officer", "Indian Audit and Accounts Service officer", "Indian Civil Service (ICS) officer"), `No Info` = c(""), `RBI Officer` = c("Career Reserve Bank of India officer") ) %>% fct_count() %>% rename(background = f, count = n) -> backgrounds backgrounds ## # A tibble: 5 x 2 ## background count ## <fct> <int> ## 1 No Info 1 ## 2 Banker 2 ## 3 RBI Officer 1 ## 4 Economist 7 ## 5 Bureaucrats 14
Hmmm.. So there were more bureaucrats than economists.
backgrounds %>% ggplot() + geom_col(aes(background, count), fill = "blue") + xlab("Background") + ylab("Count") + ggtitle("Background of RBI Governors")
Summary
- web scraping is the extraction of data from web sites
- best for static & well structured HTML pages
- review robots.txt file
- HTML code can change any time
- if API is available, please use it
- do not overwhelm websites with requests
To get in depth knowledge of R & data science, you can enroll here for our free online R courses.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.