Web Scraping Product Data in R with rvest and purrr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This article comes from Joon Im, a student in Business Science University. Joon has completed both the 201 (Advanced Machine Learning with H2O) and 102 (Shiny Web Applications) courses. Joon shows off his progress in this Web Scraping Tutorial with rvest
.
R Packages Covered:
rvest
&jsonlite
– Web ScrapingHTML
and working withJSON
datapurrr
– Iteration through lists usingmap()
andsafely()
stringr
– Text manipulationggplot2
– Data visualization and understanding data
Scraping Website Data and Analyzing Specialized Bicycles
by Joon Im, Data Analyst with Instacart
Happy Monday everyone! I recently completed the Part 2 of the Shiny Web Applications Course, DS4B 102-R and decided to make my own price prediction app. The app works by predicting prices on potential new bike models based on current existing data.
Using techniques gleaned from Matt Dancho’s Learning Lab 8 on web-scraping with rvest
to get data, I took on the challenge he mentioned there and scraped product data on bicycles from Specialized.com to create my own data set. (I highly encourage you to sign up for Learning Labs Pro: web-scraping with rvest
has fundamentally changed the way I understand the Internet). I also tried to match the website’s styling with some CSS
tweaking but I’m new to all that so please bear with me if there are issues (e.g. fonts).
I welcome any questions and would appreciate any feedback. Thank you for your time, BSU community!
My Workflow
Here’s a diagram of the workflow I used to web scrape the Specialized Data and create an application:
-
Start with URL of Specialized Bicycles
-
Use
rvest
andjsonlite
to extract product data -
Clean up data into “tidy” format using
purrr
andstringr
-
Visualize product prices with
ggplot2
-
Make a
Shiny
Web App using the Business Science 102 Course.
My Code Workflow for Web Scraping with rvest
My Shiny App
I built a shiny
web application to recommend product prices of new bicylces, which you can try out: Specialize Product Price Recommendation Application.
I explain more details about how I built my shiny
app in Section 5 – Predictive Web App.
Try out my Shiny App that Recommends Specialized Bicycle Prices using XGBoost
Tutorial – Web Scraping with rvest
This tutorial showcases how to web scrape websites using rvest
and purrr
. I’ll show how to collect data on the 2020 Specialized Bicycles Product Collection, a useful task in building a strategic database of product and competitive information for an organization.
1. Set Up
1.1 Introduction
Specialized® is a bicycle company founded by Mike Sinyard in 1974 from his hometown of Morgan Hill, California. They became known for creating the first production mountain bike back in 1981, called the Stumpjumper. Now they are building professional-grade bikes for riders around the world. Here’s a nice breakdown of different models on Bike Radar if you are interested in learning more.
Business Science is an online learning company founded by Matt Dancho in 2017 and is my favorite place to learn data science skills with R such as:
One great offering is their ongoing Learning Labs Pro series, which teaches additional skills such as time series forecasting, customer churn survival analysis, web-scraping and more.
In Learning Lab 8: Web Scraping — Build A Strategic Database With Product Data from Business Science, a challenge for students was issued to scrape product data on bikes from Specialized’s website. Today, we’re going to do just that.
In Learning Lab 8, a challenge for students was issued to scrape product data on bikes from Specialized’s website. Today, we’re going to do just that.
1.2 Check Robots
Always look at the website’s robots.txt
to check crawling permissions. Here’s Specialized’s robots.txt.
1.3 Load Libraries
Let’s start with loading libraries that we know we will need.
1.4 Check Out the Products
Let’s navigate to the “Bikes” Page for Specialized.
We can click “View All” to view all 399 bikes on a single page. This makes things a bit easier when it comes time to scrape so we don’t have to iterate over multiple pages.
Save the URL.
You can then use xopen()
to open the URL in your default web browser.
1.5 Read HTML
Load the HTML
code into an object using read_html()
. We’ve just grabbed all of the HTML from that page.
2. Get the Raw Data
Use Chrome DevTools to locate the product information. In our case, there is a JSON
-like dictionary containing what we need.
2.1 Locate Data with Chrome DevTools
Find the data by using the hover tool.
2.2 Find Product Data Nodes
Find the nodes where the product data lives.
2.3 Filter HTML to Isolate Nodes
Copy and paste the class into the html_nodes()
function from the rvest
library.
2.4 Find the Attribute That Contains the Data
2.4 Extract the Attribute Data
Extract the attributes with the html_attr()
function and store it as a JSON
object. Note that we’ll need to convert the JSON into a better format for analysis (more on this in a minute).
3. Format as Tidy Data with purrr
Tidy data is a tibble
(data frame) that has one row for the each of the Specialized Bike Models and columns for each of the features like model name, price, and various categories (denoted as dimensions).
3.1 Make a Function that Converts JSON to Tibble
This function is just a wrapper for toJSON
from the jsonlite
package. The only addition is converting the data frame
to a tibble
using as_tibble()
.
We can run this on the first element of the list.
name | id | brand | price | currencyCode | position | variant | dimension1 | dimension2 | dimension3 | dimension4 | dimension5 | dimension6 | dimension7 | dimension8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S-Works Roubaix - SRAM Red eTap AXS | 171042 | Specialized | 11500 | USD | 61 | Bikes | Road | Roubaix | Performance Road | S-Works | Men/Women |
3.2 Iterate to All JSON Elements
We’ll use map()
to iteratively apply our from_json_to_tibble()
function. If we just run this, the iterative conversion error’s out - This is common in long-running iterative scripts. We can get around this using the safely()
function, which isolates the errors and allows the iteration to continue (instead of grinding to a hault).
3.3 Inspect First Converted Element
We can see that a list is returned with 2 elements for each item:
-
$result
- Contains the result. If conversion succeeds, we get atibble
. If error, we getNULL
. -
$error
- Contains the error message (if error). Otherwise, we getNULL
.
3.4 Inspect for Errors
We are bound to get errors in this JSON conversion process for 399 bikes. Let’s check to see where errors occurred.
3.5 What happened?
We got two errors - Bike 222 and 288. We can use pluck()
to grab the first error in the “value” column. It’s the result of an errant "
symbol that represents inches.
We can get around this by replacing the "
. Let’s re-run the code using the str_replace()
function to replace the quote.
We get another error. There is an errant set of quotes around “BMX / Dirt Jump”. We can use str_replace()
again to resolve. Success!
3.6 Run Again - Success - Errors Fixed!
We can try one more time, now using the str_replace()
to remove the quotes causing conversion errors, and map_dfr()
to return a data frame stacked row-wise.
name | id | brand | price | currencyCode | position | variant | dimension1 | dimension2 | dimension3 | dimension4 | dimension5 | dimension6 | dimension7 | dimension8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S-Works Roubaix - SRAM Red eTap AXS | 171042 | Specialized | 11500 | USD | 61 | Bikes | Road | Roubaix | Performance Road | S-Works | Men/Women | |||
S-Works Roubaix - Shimano Dura-Ace Di2 | 170241 | Specialized | 11000 | USD | 56 | Bikes | Road | Roubaix | Performance Road | S-Works | Men/Women | |||
S-Works Epic AXS | 171229 | Specialized | 11020 | USD | S | Bikes | Mountain | Epic FSR/Epic | Cross Country | S-Works | Men/Women | |||
Stumpjumper EVO Comp Carbon 29 | 173494 | Specialized | 4520 | USD | S3 | Bikes | Mountain | Stumpjumper EVO | Trail | Men/Women | ||||
Stumpjumper EVO Comp Carbon 27.5 | 173495 | Specialized | 4520 | USD | S2 | Bikes | Mountain | Stumpjumper EVO | Trail | Men/Women | ||||
Fuse Expert 29 | 171068 | Specialized | 2150 | USD | XS | Bikes | Mountain | Fuse | Trail | Men/Women |
4. Explore Bike Models
I want to understand how price depends on various features like model, type of bike (electric, mountain, road), and other features that will eventually be used in my XGBoost
Machine Learning model inside of my Shiny
Web App.
4.1 Most and Least Expensive Bike Models
There’s a clear relationship between price and “Dimension 3” (bike model). We can see this visually.
4.2 S-Works Effect
I also noticed that “S-Works” is Specialized’s Premium Brand. We can update the ggplot2
visualization to segment bikes with “S-Works” in the model name to visually compare the “S-Works Effect”. I see that the S-Works bikes tend to have a higher median price than “non-S-Works”.
5. Predictive Web Application
I made and deployed a Product Price Recommendation Application for Specialized Bicycles using the web-scraped Specialized Data. Here’s how I built it:
-
The
Shiny
app uses the webscraped data from 2019 Specialized Models (this tutorial covers web-scraping 2020 models), which I learned in Learning Lab 8. -
I built the Shiny app using Part 2 of the Shiny Web Applications Course (DS4B 102-R), the 2nd course in the 3-Course R-Track.
-
The
shiny
application uses anXGBoost
Machine Learning model to recommend product prices based on the existing product portfolio. -
The code is available in my GitHub Repo Here.
Try out my Shiny App that Recommends Specialized Bicycle Prices using XGBoost
Parting Thoughts
Web-scraping with rvest
has fundamentally changed the way I understand the Internet. Once I realized that the entire Internet (well, most of it) is basically just one big database, it rocked my world. I highly encourage you to sign up for Learning Labs Pro. Learning Lab 8 - Web Scraping - Build A Strategic Database With Product Data with rvest
was what opened my eyes to the power of web scraping.
Using the data, I was able to make and deploy a Shiny
web application that uses an XGBoost
Machine Learning model to predict and recommend bicycle prices. This is just one way that businesses can use the strategic database. If you want to learn shiny
, I highly recommend the Shiny Web Applications Course by Business Science. You can take it as part of the 3-Course R-Track Bundle offered by Business Science.
Other Student Articles You Might Enjoy
Here are two more Student Success Tutorials related to scraping data and building shiny
applications.
-
PDF Scraping in R with tabulizer - By Jennifer Cooper
-
Build An R Shiny App - Wedding Risk Model - By Bryan Clark
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.