Tutorial: Web Scraping of Multiple Pages using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In today’s world, data is being generated at an exponential rate. This massive amount of data and information is essential for many individuals and tech giants in various useful ways.
So, having access to precise data in abundance will serve you just right in any field in gaining insights and performing further analysis. Therefore, Web Scraping has become a must have skill especially if you are a data scientist.
All the data is available on the Internet today. But, how to scrape data that might be useful to you? Well, you have got it all sorted out. With all the advanced tools and programming languages, scraping data out from the web is just one cushy job.
Let’s dive straight to the point.
Web Scraping?
Web Scraping is just a technique to convert unorganized data that is usually available on the internet to an organized format so that it can be useful to us.
The very basic idea of scraping data is the old school method of COPY AND PASTE . Well, to be honest, this method might sound easy-peasy but is taxing, monotonous, time-dependent and not at all fascinating.
But with a few lines of code it is utterly possible. So, let’s see how can we scrape data.
Web Scraping using R
Expecting that you all will be having a basic knowledge about how R works and its syntax, lets get straight to this short tutorial where I’ll show you How To Scrape Data using R from multiple pages at once.
For general text data scraping: you can visit: Basic Web Scraping
About the Data
From The Numbers, here lies the complete list of movies with their release dates, production budget and gross revenue information. The profit and loss figures are very rough estimates based on domestic and international box office earnings and domestic video sales, extrapolated to estimate worldwide income to the studio, after deducting retail costs.
Note: The movies’ data is in the tabular format.
Following are the steps you need to follow:
- Open R Studio. Then in a new file:
Package Installation
Install the required packages.
xml2:
Xml2
is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in Rrvest:
rvest
helps you scrape information from web. pages.tibble: The
tibble
package provides utilities for handling tibbles, where “tibble” is a colloquial term for the S3 tbl_df class. The tbl_df class is a special case of the basedata.frame
.
library(xml2) library(rvest) ##very important library(tibble)
- Storing the
url
of the first page of the table with data of about 100 movies inbase_url
:
base_url <- "https://www.the-numbers.com/movie/budgets/all"
- Scraping html content from the stored url:
base_webpage <- read_html(base_url)
- Now, as you can see here, after all/101 is present. Similarly, there are many more pages with 100 movies each in the table all with different urls.
So, should we store 100 urls for 100 pages for 10,000 movies? Ofcourse not! We have certain string formatting styles. You can visit the documentation here.
Hence, for strings, we use %s
.
new_urls<- "https://www.the-numbers.com/movie/budgets/all/%s"
- Creating dataframe of the first 100 movies:
html_table()
: converts html tables into dataframes.
table_base <- rvest::html_table(base_webpage)[[1]] %>% tibble::as_tibble(.name_repair = "unique") # repair the repeated columns
- Creating dataframe of the next set of movies:
#creating two empty dataframes table_new <-data.frame() df <- data.frame() #iterator i<-101 #it loops through 5501 times so as to extract and then store and then combine about 5000 movies so far extracted. while (i<5502) { new_webpage<- read_html(sprintf(new_urls,i)) table_new <- rvest::html_table(new_webpage)[[1]] %>% tibble::as_tibble(.name_repair = "unique") # repair the repeated columns df<- rbind(df,table_new) i=i+100 }
- Merge the
table_base
anddf
:
df_movies <- merge(table_base,df, all = T)
- Let us see how are dataframe looks exactly:
head(df_movies) ## ...1 ReleaseDate Movie ProductionBudget ## 1 1 Apr 23, 2019 Avengers: Endgame $400,000,000 ## 2 1,000 Apr 28, 2000 The Flintstones in Viva Rock Vegas $58,000,000 ## 3 1,001 Apr 4, 2008 Leatherheads $58,000,000 ## 4 1,002 Mar 22, 2017 Life $58,000,000 ## 5 1,003 Dec 18, 2009 Did You Hear About the Morgans? $58,000,000 ## 6 1,004 Dec 12, 2008 Che, Part 1: The Argentine $58,000,000 ## DomesticGross WorldwideGross ## 1 $858,373,000 $2,797,800,564 ## 2 $35,231,365 $59,431,365 ## 3 $31,373,938 $41,348,628 ## 4 $30,234,022 $100,929,666 ## 5 $29,580,087 $80,480,566 ## 6 $1,802,521 $31,627,370
Viola! We have accomplished our task.
- Now if you want, you can create a
csv
file of this dataframe for physically storing it in your system using:
write.csv(df_movies,"moviesData_tutorial.csv")
Conclusion
See, here it is done. With a few lines of code, we were able to extract data from multiple pages using one single loop. This tutorial basically hints on using string formatting style.
Stay tuned for more tutorials!
Thank You!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.