R for SEO Part 9: Web Scraping With R & Rvest
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R for SEO Part 9: Web Scraping With R & Rvest
Hello, and welcome back. We’re (finally) in the home stretch of our R for SEO series with part nine, where we’re talking about scraping the web using R, particularly using the rvest package.
Today, we’re going to discuss the rvest package, look at the different scraping methods available, pull data from multiple pages and then look at how we can do so without bringing our target sites down. This is going to be fairly important for our final piece, so it’s worth paying attention today.
As always, this is a long piece, so feel free to use the table of contents below and do please sign up for my email list, where you’ll get updates of fresh content for free.
The Rvest Package
The rvest package for R – another Hadley Wickham creation – is the most commonly used web scraping package for the R language, and it’s easy to see why. It brings a lot of the Tidyverse’s tidy data outputs and notation functionality to what can be a complex element of data analysis and SEO. I’m a fan.
It’s not included in the Tidyverse package, so you’ll need to install it separately. You can do that like so:
install.packages("rvest") library(rvest)
This will install the rvest package and get it initialised. It’s also worth installing the Tidyverse as always.
library(tidyverse)
Now we have rvest installed, we can get scraping. But how do we find what to scrape from a page? That’s where we start needing to understand a bit about how scraping works and how to identify our data points.
XPath & CSS Selectors
XPath and CSS selectors are the most widely used ways of identifying elements on a page, and are both crucial to understand when it comes to scraping the web.
Personally, since I’m a little bit more old-school, I tend to default to XPath, but that’s not to say CSS selectors aren’t brilliant – they certainly help you write much more efficient code.
Let’s look at the differences between the two.
What Is XPath?
XPath – or XML Path Language – is a syntax used for selecting nodes in a document. Think of it like a way to pinpoint specific parts of an XML tree structure, like finding particular branches or leaves. XPath has been around a long time and is very powerful, and rvest has great support for it.
There are a number of ways that you can find the relevant XPath query that you’ll need to scrape your chosen elements from your pages, and we’ll cover that very shortly.
What Are CSS Selectors?
CSS selectors are the visual language of the web, used to style and structure web pages, but they can do much more than just create pretty sites. In R, you can leverage CSS selectors to extract the almost any data you need from web pages.
Again, rvest has a lot of native support for CSS selectors, and they tend to be a lot more efficient to write queries with, as well as needing less debugging. Both methods are completely valid and, truthfully, while I tend to default to XPath, CSS selectors are finding their way into my work a lot more due to having to write less code. But what are the key differences?
XPath Vs CSS Selectors
Using Vs is a bit of a misnomer here – it’s not a fight, because they’re both worth using, but it’s worth understanding a little bit more about the differences between the two and when to use them.
In general terms – there are always exceptions – CSS selectors are:
- Easier & more efficient to write: A CSS selector is much shorter than an XPath query, and usually more intuitive, especially if you’ve been doing a lot of technical SEO work over your career
- Faster to run: Especially if you’re using browser-native CSS, which can make large-scale scraping projects much quicker
- Primarily designed for HTML pages: They’re not great at navigating up the DOM tree, and they’re not always so good for specific text elements
Conversely, XPath tends to be:
- More powerful: You can do very complex queries with XPath, which can allow you to navigate the DOM tree in both directions, as well as scraping elements from parts of the site that don’t have CSS attached – more on that shortly
- More complex: Due to the power, an XPath query will generally be longer and more complex to write and maintain – but don’t let that put you off
So now we know the two key methods we’re going to be using to identify the elements we’ll be scraping today, let’s talk about how to find them.
Finding CSS Selectors & XPath Queries
Obviously, it’s all well and good saying you want to use R to scrape a certain part of a page, such as the H1 or image alt text, but how do you actually find the elements you need? How do you identify the CSS selector without spending ages digging through the code and how do you go about putting an XPath query together?
Fortunately, there are a few very quick and easy ways to do that using Chrome. Let’s take a look at a couple of them.
The SelectorGadget Chrome Extension
SelectorGadget is a handy, free Chrome extension that I use a lot. It’s very quick and easy to use and can help you find your CSS selector or XPath at the click of a button. It’s not perfect – few things are – but I find it hits more than it misses.
Install it into your Chrome browser and then go to the page you want to scrape an element from. Let’s look at my last post in this series.
Now let’s say we want to scrape the article title.
Click the SelectorGadget extension and hover over the article title like so:
You’ll see that it’s highlighted the selector.
If you look into the SelectorGadget bar and click the title, it’ll put the following in there:
And there we go, we have our CSS selector.
Told you it was easy.
Now if you click on the XPath button to the right of the extension, like so:
You’ll get a popup with the XPath query you need.
Again, these aren’t always perfect and sometimes won’t work the way they should, but they’ll give you a good starting point.
But what if, for some reason, you can’t install the extension, or you find it doesn’t work the way it should on certain sites? Fortunately, there’s another option that’s built right into Chrome and most other browsers.
Finding CSS Selectors & XPath Queries With Chrome Developer Tools
I’m sure if you’ve been working in SEO for a while, you’ve become very familiar with Chrome or other browsers’ Developer Tools window. I don’t think I go a day without it, but did you know that you can also use it to find the CSS Selectors or create an XPath query based on a specific element? You probably did, but let’s talk through it anyway.
Personally, I tend to use this more when I’m trying to build a scraper for an area that either doesn’t have CSS to it, such as something in the head, or when something stops SelectorGadget working effectively, but it’s definitely worth learning how to use it to help you as you get more familiar with scraping in R.
Go back to the page we discussed in the last section. Now highlight a part of the article title and right-click. Now click “Inspect”, like so:
This will bring up the familiar “Elements” window:
Now if we right click on the element we care about – our article title, in this case, you’ll see the following dialogue. If you hover over “Copy”, it’ll pop out with a few very handy options:
You can copy your CSS selector or your XPath query directly from here, ready to paste in your code. I told you it was handy!
Again, it’s also not perfect, but between developer tools and SelectorGadget, I can pretty much always find the right element notation for whatever I’m trying to scrape.
I’m conscious that we’re getting scarily close to 1,500 words and we haven’t written a single line of actual code yet! Let’s fix that by using R to scrape the article title with the CSS selector.
Scraping Article Titles With R & CSS Selectors
One of the things that is always worth remembering about using CSS selectors is that they will typically differ between websites, due to the fact that CSS styling tends to be individual to the site in question. That’s why tools like SelectorGadget or Chrome Dev Tools are so useful. Let’s write our first scraper using R’s rvest package to scrape the article title from my last post.
First, create an object of the URL to that post, like so:
scrapeURL <- "https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/"
Now we want to find our CSS selector. Do that with SelectorGadget, and our selector will be as follows:
.entry-title
Alright, we’re ready. Let’s create a very simple scraping call:
articleTitle <- read_html(scrapeURL) %>% html_element(".entry-title") %>% html_text()
Run that in your console, now type articleTitle and you should see the following output:
Easy, right? Let’s take a look at how it works.
Our First Scraper Command
Rvest was created by the same brain behind the tidyverse (you may have guessed that I’m a fan of Mr Wickham’s work by now), and you can see certain similarities in other commands that we’ve written throughout this series.
As always, let’s break it down:
- articleTitle <-: We’re giving our object name the incredibly inventive name of “articleTitle”
- read_html(scrapeURL): Now we’re calling the read_html() function from rvest to download the html of our article that we put into scrapeURL
- %>%: We’re using the tidyverse’s pipe parameter to create multiple commands in one – you’ll have seen me use this throughout this series, but it’s always good to remind ourselves
- html_element(“.entry-title”): We’re telling rvest that we want to look specifically for the element (our CSS selector in this case) “.entry-title”
- %>% html_text(): Finally, we’re chaining in the html_text() command to only show us the text from the element we’ve selected
And there you have it – we’ve built our very first scraper. Now let’s look at how we can use XPath from R’s rvest package to scrape different elements.
Scraping Meta Titles With R & XPath
Now let’s look at how we can use XPath with R and the rvest package to pull the meta title from the same page. This is a nice, simple command, but hopefully it’ll give you an idea of how the power of XPath can be added to your web scraping work.
As before, we’re going to use my last post as the target page, so our scrapeURL
object is still valid.
Since we’re going for the meta title in this part rather than something visible on the page, SelectorGadget isn’t going to help us. Fortunately, we know how to use Chrome Dev Tools to find this.
Go to your target URL and right click anywhere on the page to bring up the elements window. Now find and expand the
element.We want to focus on the
Select “Copy XPath” and paste it somewhere.
Now we want to use the following command:
articleMetaTitle <- read_html(scrapeURL) %>% html_element(, "//title") %>% html_text()
Not too dissimilar to our previous scraping command, is it? But there are a couple of key differences, which we’ll break down shortly.
After this runs, type articleMetaTitle
into your console, and you should see the following:
And there we have it – our meta title pulled into our R environment. As always, let’s investigate the command.
Our Meta Title Scraping Command Broken Down
As you can see, using XPath on elements isn’t too dissimilar to CSS selectors when we’re using R to scrape the web – albeit, I’ve used the simplest XPath command I could have for this example, but hopefully you’re starting to see how this can be used for your SEO work.
Let’s break this command down:
- read_html(scrapeURL) %>%: As before, we’ve created our articleMetaTitle object and used read_html on our scrapeURL page object and then used the pipe command to link our command up with the next
- html_element(, “//title”) %>%: This is where the key difference between using XPath and CSS selectors comes in – the comma. The comma tells R that we’re not using a CSS selector, but rather XPath. In this case, we’re using a very basic XPath command to scrape the title element and then we’re chaining to our next command
- html_text(): As before, we’re using html_text() to only show us the text of our scraped data
This is quite easy, right? Building scrapers in R isn’t actually as complicated as it sounds, but obviously all we’ve done so far is pull one element at a time from a specific page.
Now let’s look at how we can pull multiple elements from a page.
Scraping Titles, Descriptions & H1s From A Page Using R
Obviously, when we think about scraping the web, and using a programming language like R to do it, we’re thinking about scraping more than one thing at a time and getting them to a useful dataframe. So far, we’ve looked at using specific elements, but the point is scale – so now, let’s take a look at how we can scrape multiple elements from a page using R and rvest.
We’re going to build a function to scrape the meta title, meta description and the H1 from my last post. After that, we’ll look at scraping multiple elements from multiple pages.
Firstly, we need to get our elements. You can do that with a combination of Chrome Dev Tools and SelectorGadget, and with the power of regular expressions, we can end up with the following list of elements:
- //title: The meta title XPath
- “meta[name=’description’]”), “content”: The attribute for the meta description
- .entry-title: The selector for gathering the article title, the H1
Now we’ve got our list of elements, let’s get to work on our function. It’ll look a little something like this:
pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) }
You can run this function like so:
pageElements <- pageScrape(scrapeURL)
And it’ll give you the following output:
Still pretty simple, right? As always, let’s break it down.
Our Multiple-Element Scraper Broken Down
As you can see, we’ve used a function to pull some different elements from the page using the various nodes we’ve identified, and it’s given us some useful SEO data. Let’s dig in to how this function works.
- pageScrape <- function(x){: We’re creating our function called pageScrape with the x variable
- pageContent <- read_html(x): Here, we’re getting the html of our target page into our environment using rvests read_html() function
- metaTitle <- html_element(pageContent,, “//title”) %>% html_text(): As we saw with our earlier meta title scraping command, we’re using XPath to pull the title with “//title” and using html_text() to just get the text
- metaDescription <- html_attr(html_element(pageContent, “meta[name=’description’]”), “content”): Meta descriptions can often be one of the most annoying parts to scrape, and they sometimes differ between sites. In this case, we’re using html_attr() instead of html_text() because the meta description is stored within the content attribute
- heading <- html_element(pageContent, “.entry-title”) %>% html_text(): We’re re-using our H1 scraper from earlier, pulling the H1 using the .entry-title CSS selector
- output <- data.frame(metaTitle, metaDescription, heading): Finally, we’re creating our output dataframe that puts the meta title, meta description and H1 into separate columns
Now let’s think about how we can run this across multiple pages.
Applying R Scrapers To Multiple Pages
If you’ve read my previous posts on using loops and apply methods in R, you’ll have an idea of how we can run this across multiple pages.
For consistency’s sake, let’s run it across all the pages on my site, since the elements will all be the same.
Firstly, we want to get all of our target URLs into an object. The easiest way to do this is to scrape my pages sitemap.
We can do that like so:
pagesSitemap <- read_html("https://www.ben-johnston.co.uk/page-sitemap.xml") %>% html_elements(, "//loc") %>% html_text()
Now if we use the reduce() function from the tidyverse as we did in part 8, we can scrape all of the elements we discussed previously from all of my pages like so:
pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows)
I won’t break this particular one down, as it’s covered in depth in part 8.
You’ll see a few NAs in there because featured images are included in the sitemap, but you get the idea. For a more robust method, you could subset using the methods we learned all the way back in part 1, like so:
pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, "wp-content") == FALSE)
Now we’ll only have the html pages. If we run our scraper again, we’ll get the following:
So now we know how to scrape multiple elements from multiple pages, let’s talk about how we do that politely.
Using The Polite R Package To Reduce Scrape Load
Scraping websites isn’t always the most popular thing with site owners. Sometimes they don’t want their content being used that way (I personally block ChatGPT for precisely that reason), and also scraping multiple pages can put a large amount of load on a server, sometimes costing them money or even making them think that their site is under attack.
The Polite package for R ensures that your scraper respects robots.txt and also helps you reduce the amount of load you’re putting on a server. I’m giving you the techniques to scrape websites with R here, but it would be remiss of me not to tell you that you should do so politely and respect the owners of the sites.
Lecture over, let’s get the Polite package installed.
install.packages("polite") library(polite)
Using The Polite R Package’s Bow Function
The Polite package has two key functions: bow
and scrape
. It is very polite. Bow introduces your R environment to the server and asks permission to scrape, looking at the robots.txt file and scrape runs the scraper.
The three tenets of a polite R scraping session are defined by the authors as “seeking permission, taking slowly and never asking twice” and that’s as good a definition of scraping the web as we’ll find.
Let’s introduce ourselves to the target site using bow:
session <- bow("https://www.ben-johnston.co.uk")
If we inspect our session object now, we’ll see the following:
This has introduced us to the website and checked the robots.txt. Now let’s update our previous multiple page scraper to do so politely.
Using The Polite R Package’s Scrape Function On Multiple URLs
Now we understand about scraping politely, let’s update our previous pageScrape
function to scrape our multiple URLs politely.
This isn’t overly complicated and our updated function is as follows:
politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) }
It’s similar to our previous scraper function, isn’t it? But there’s one key difference: rather than using read_html()
, we’re using the polite package’s scrape()
function and creating a new bow() for every page, ensuring that we are introducing ourselves to each page and making sure that we’re respecting robots.txt and not hitting the site too hard.
We can run it like so:
politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows)
And if we inspect our politeElements object in the console, we should see the following:
So there we have it, that’s how you can use R’s rvest and polite packages to scrape multiple pages while being a good internet user.
Wrapping Up
This was quite a long piece and we’re reaching the end of my R for SEO series (although there will definitely be a couple of bonus entries). I hope you’ve enjoyed today’s article on using R to scrape the web, you’ve learned to do so politely and you’re seeing some applications for using this in your SEO work.
Until next time, where I’ll be talking about how we can build a reporting dashboard with R, Google Sheets and Google Looker Studio, using everything we’ve learned throughout this series.
Our Code From Today
# Install Packages install.packages("rvest") library(rvest) library(tidyverse) # Scrape Article Title With CSS Selector scrapeURL <- "https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/" articleTitle <- read_html(scrapeURL) %>% html_element(".entry-title") %>% html_text() # Scrape Meta Titles With XPath articleMetaTitle <- read_html(scrapeURL) %>% html_element(, "//title") %>% html_text() #Scrape Multiple Elements pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } pageElements <- pageScrape(scrapeURL) # Scrape Multiple Pages pagesSitemap <- read_html("https://www.ben-johnston.co.uk/page-sitemap.xml") %>% html_elements(, "//loc") %>% html_text() pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, "wp-content") == FALSE) pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows) # Scraping Politely install.packages("polite") library(polite) session <- bow("https://www.ben-johnston.co.uk") politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows)
This post was written by Ben Johnston on Ben Johnston
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.