R for SEO Part 9: Web Scraping With R & Rvest

[This article was first published on R | Ben Johnston, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R for SEO Part 9: Web Scraping With R & Rvest

Hello, and welcome back. We’re (finally) in the home stretch of our R for SEO series with part nine, where we’re talking about scraping the web using R, particularly using the rvest package.

Today, we’re going to discuss the rvest package, look at the different scraping methods available, pull data from multiple pages and then look at how we can do so without bringing our target sites down. This is going to be fairly important for our final piece, so it’s worth paying attention today.

As always, this is a long piece, so feel free to use the table of contents below and do please sign up for my email list, where you’ll get updates of fresh content for free.

The Rvest Package

The rvest package for R – another Hadley Wickham creation – is the most commonly used web scraping package for the R language, and it’s easy to see why. It brings a lot of the Tidyverse’s tidy data outputs and notation functionality to what can be a complex element of data analysis and SEO. I’m a fan.

It’s not included in the Tidyverse package, so you’ll need to install it separately. You can do that like so:

install.packages("rvest")

library(rvest)

This will install the rvest package and get it initialised. It’s also worth installing the Tidyverse as always.

library(tidyverse)

Now we have rvest installed, we can get scraping. But how do we find what to scrape from a page? That’s where we start needing to understand a bit about how scraping works and how to identify our data points.

XPath & CSS Selectors

XPath and CSS selectors are the most widely used ways of identifying elements on a page, and are both crucial to understand when it comes to scraping the web.

Personally, since I’m a little bit more old-school, I tend to default to XPath, but that’s not to say CSS selectors aren’t brilliant – they certainly help you write much more efficient code.

Let’s look at the differences between the two.

What Is XPath?

XPath – or XML Path Language – is a syntax used for selecting nodes in a document. Think of it like a way to pinpoint specific parts of an XML tree structure, like finding particular branches or leaves. XPath has been around a long time and is very powerful, and rvest has great support for it.

There are a number of ways that you can find the relevant XPath query that you’ll need to scrape your chosen elements from your pages, and we’ll cover that very shortly.

What Are CSS Selectors?

CSS selectors are the visual language of the web, used to style and structure web pages, but they can do much more than just create pretty sites. In R, you can leverage CSS selectors to extract the almost any data you need from web pages.

Again, rvest has a lot of native support for CSS selectors, and they tend to be a lot more efficient to write queries with, as well as needing less debugging. Both methods are completely valid and, truthfully, while I tend to default to XPath, CSS selectors are finding their way into my work a lot more due to having to write less code. But what are the key differences?

XPath Vs CSS Selectors

Using Vs is a bit of a misnomer here – it’s not a fight, because they’re both worth using, but it’s worth understanding a little bit more about the differences between the two and when to use them.

In general terms – there are always exceptions – CSS selectors are:

  • Easier & more efficient to write: A CSS selector is much shorter than an XPath query, and usually more intuitive, especially if you’ve been doing a lot of technical SEO work over your career
  • Faster to run: Especially if you’re using browser-native CSS, which can make large-scale scraping projects much quicker
  • Primarily designed for HTML pages: They’re not great at navigating up the DOM tree, and they’re not always so good for specific text elements

Conversely, XPath tends to be:

  • More powerful: You can do very complex queries with XPath, which can allow you to navigate the DOM tree in both directions, as well as scraping elements from parts of the site that don’t have CSS attached – more on that shortly
  • More complex: Due to the power, an XPath query will generally be longer and more complex to write and maintain – but don’t let that put you off

So now we know the two key methods we’re going to be using to identify the elements we’ll be scraping today, let’s talk about how to find them.

Finding CSS Selectors & XPath Queries

Obviously, it’s all well and good saying you want to use R to scrape a certain part of a page, such as the H1 or image alt text, but how do you actually find the elements you need? How do you identify the CSS selector without spending ages digging through the code and how do you go about putting an XPath query together?

Fortunately, there are a few very quick and easy ways to do that using Chrome. Let’s take a look at a couple of them.

The SelectorGadget Chrome Extension

SelectorGadget is a handy, free Chrome extension that I use a lot. It’s very quick and easy to use and can help you find your CSS selector or XPath at the click of a button. It’s not perfect – few things are – but I find it hits more than it misses.

Install it into your Chrome browser and then go to the page you want to scrape an element from. Let’s look at my last post in this series.

Now let’s say we want to scrape the article title.

Click the SelectorGadget extension and hover over the article title like so:

SelectorGadget highlighting article title

You’ll see that it’s highlighted the selector.

If you look into the SelectorGadget bar and click the title, it’ll put the following in there:

SelectorGadget CSS selector highlight

And there we go, we have our CSS selector.

Told you it was easy.

Now if you click on the XPath button to the right of the extension, like so:

XPath highlight in SelectorGadget

You’ll get a popup with the XPath query you need.

XPath popup from SelectorGadget

Again, these aren’t always perfect and sometimes won’t work the way they should, but they’ll give you a good starting point.

But what if, for some reason, you can’t install the extension, or you find it doesn’t work the way it should on certain sites? Fortunately, there’s another option that’s built right into Chrome and most other browsers.

Finding CSS Selectors & XPath Queries With Chrome Developer Tools

I’m sure if you’ve been working in SEO for a while, you’ve become very familiar with Chrome or other browsers’ Developer Tools window. I don’t think I go a day without it, but did you know that you can also use it to find the CSS Selectors or create an XPath query based on a specific element? You probably did, but let’s talk through it anyway.

Personally, I tend to use this more when I’m trying to build a scraper for an area that either doesn’t have CSS to it, such as something in the head, or when something stops SelectorGadget working effectively, but it’s definitely worth learning how to use it to help you as you get more familiar with scraping in R.

Go back to the page we discussed in the last section. Now highlight a part of the article title and right-click. Now click “Inspect”, like so:

Using Chrome Developer Tools on an article title

This will bring up the familiar “Elements” window:

Chrome Developer Tools highlights article title

Now if we right click on the element we care about – our article title, in this case, you’ll see the following dialogue. If you hover over “Copy”, it’ll pop out with a few very handy options:

Chrome Developer Tools find CSS Selector or XPath

You can copy your CSS selector or your XPath query directly from here, ready to paste in your code. I told you it was handy!

Again, it’s also not perfect, but between developer tools and SelectorGadget, I can pretty much always find the right element notation for whatever I’m trying to scrape.

I’m conscious that we’re getting scarily close to 1,500 words and we haven’t written a single line of actual code yet! Let’s fix that by using R to scrape the article title with the CSS selector.

Scraping Article Titles With R & CSS Selectors

One of the things that is always worth remembering about using CSS selectors is that they will typically differ between websites, due to the fact that CSS styling tends to be individual to the site in question. That’s why tools like SelectorGadget or Chrome Dev Tools are so useful. Let’s write our first scraper using R’s rvest package to scrape the article title from my last post.

First, create an object of the URL to that post, like so:

scrapeURL <- "https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/"

Now we want to find our CSS selector. Do that with SelectorGadget, and our selector will be as follows:

.entry-title

Alright, we’re ready. Let’s create a very simple scraping call:

articleTitle <- read_html(scrapeURL) %>% html_element(".entry-title") %>% 
  html_text()

Run that in your console, now type articleTitle and you should see the following output:

R scraping article title output

Easy, right? Let’s take a look at how it works.

Our First Scraper Command

Rvest was created by the same brain behind the tidyverse (you may have guessed that I’m a fan of Mr Wickham’s work by now), and you can see certain similarities in other commands that we’ve written throughout this series.

As always, let’s break it down:

  • articleTitle <-: We’re giving our object name the incredibly inventive name of “articleTitle”
  • read_html(scrapeURL): Now we’re calling the read_html() function from rvest to download the html of our article that we put into scrapeURL
  • %>%: We’re using the tidyverse’s pipe parameter to create multiple commands in one – you’ll have seen me use this throughout this series, but it’s always good to remind ourselves
  • html_element(“.entry-title”): We’re telling rvest that we want to look specifically for the element (our CSS selector in this case) “.entry-title”
  • %>% html_text(): Finally, we’re chaining in the html_text() command to only show us the text from the element we’ve selected

And there you have it – we’ve built our very first scraper. Now let’s look at how we can use XPath from R’s rvest package to scrape different elements.

Scraping Meta Titles With R & XPath

Now let’s look at how we can use XPath with R and the rvest package to pull the meta title from the same page. This is a nice, simple command, but hopefully it’ll give you an idea of how the power of XPath can be added to your web scraping work.

As before, we’re going to use my last post as the target page, so our scrapeURL object is still valid.

Since we’re going for the meta title in this part rather than something visible on the page, SelectorGadget isn’t going to help us. Fortunately, we know how to use Chrome Dev Tools to find this.

Go to your target URL and right click anywhere on the page to bring up the elements window. Now find and expand the element.

We want to focus on the element here, so find that and right click on it, like so:</p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt="Copying meta title XPath in Chrome Developer Tools" class="wp-image-3799" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428.png?w=450&ssl=1 979w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-300x120.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-150x60.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-768x306.png 768w" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428.png?w=450&ssl=1" data-lazy-sizes="(max-width: 979px) 100vw, 979px"><noscript><img loading="lazy" decoding="async" src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428.png?w=450&ssl=1" alt="Copying meta title XPath in Chrome Developer Tools" class="wp-image-3799" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428.png?w=450&ssl=1 979w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-300x120.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-150x60.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428-768x306.png 768w" sizes="(max-width: 979px) 100vw, 979px" data-recalc-dims="1" /></noscript></figure> <p>Select “Copy XPath” and paste it somewhere.</p> <p>Now we want to use the following command:</p> <pre>articleMetaTitle <- read_html(scrapeURL) %>% html_element(, "//title") %>% html_text()</pre> <p>Not too dissimilar to our previous scraping command, is it? But there are a couple of key differences, which we’ll break down shortly.</p> <p>After this runs, type <code>articleMetaTitle </code>into your console, and you should see the following:</p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt="Article meta title scraped in R console" class="wp-image-3801" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449.png?w=450&ssl=1 674w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449-300x18.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449-150x9.png 150w" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449.png?w=450&ssl=1" data-lazy-sizes="(max-width: 674px) 100vw, 674px"><noscript><img loading="lazy" decoding="async" src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449.png?w=450&ssl=1" alt="Article meta title scraped in R console" class="wp-image-3801" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449.png?w=450&ssl=1 674w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449-300x18.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449-150x9.png 150w" sizes="(max-width: 674px) 100vw, 674px" data-recalc-dims="1" /></noscript></figure> <p>And there we have it – our meta title pulled into our R environment. As always, let’s investigate the command.</p> <h3 class="wp-block-heading">Our Meta Title Scraping Command Broken Down</h3> <p>As you can see, using XPath on elements isn’t too dissimilar to CSS selectors when we’re using R to scrape the web – albeit, I’ve used the simplest XPath command I could have for this example, but hopefully you’re starting to see how this can be used for your SEO work.</p> <p>Let’s break this command down:</p> <ul class="wp-block-list"> <li><strong>read_html(scrapeURL) %>%:</strong> As before, we’ve created our articleMetaTitle object and used read_html on our scrapeURL page object and then used the pipe command to link our command up with the next</li> <li><strong>html_element(, “//title”) %>%: </strong>This is where the key difference between using XPath and CSS selectors comes in – the comma. The comma tells R that we’re not using a CSS selector, but rather XPath. In this case, we’re using a very basic XPath command to scrape the title element and then we’re chaining to our next command</li> <li><strong>html_text():</strong> As before, we’re using html_text() to only show us the text of our scraped data</li> </ul> <p>This is quite easy, right? Building scrapers in R isn’t actually as complicated as it sounds, but obviously all we’ve done so far is pull one element at a time from a specific page.</p> <p>Now let’s look at how we can pull multiple elements from a page.</p> <h2 class="wp-block-heading">Scraping Titles, Descriptions & H1s From A Page Using R</h2> <p>Obviously, when we think about scraping the web, and using a programming language like R to do it, we’re thinking about scraping more than one thing at a time and getting them to a useful dataframe. So far, we’ve looked at using specific elements, but the point is scale – so now, let’s take a look at how we can scrape multiple elements from a page using R and rvest.</p> <p>We’re going to build a function to scrape the meta title, meta description and the H1 from my last post. After that, we’ll look at scraping multiple elements from multiple pages.</p> <p>Firstly, we need to get our elements. You can do that with a combination of Chrome Dev Tools and SelectorGadget, and with the power of regular expressions, we can end up with the following list of elements:</p> <ul class="wp-block-list"> <li><strong>//title:</strong> The meta title XPath</li> <li><strong>“meta[name=’description’]”), “content”:</strong> The attribute for the meta description</li> <li><strong>.entry-title: </strong>The selector for gathering the article title, the H1</li> </ul> <p>Now we’ve got our list of elements, let’s get to work on our function. It’ll look a little something like this:</p> <pre>pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) }</pre> <p>You can run this function like so:</p> <pre>pageElements <- pageScrape(scrapeURL)</pre> <p>And it’ll give you the following output:</p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt="Multiple elements from a page scraped with R" class="wp-image-3803" srcset_temp="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730.png?w=450&ssl=1 1004w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-300x56.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-150x28.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-768x145.png 768w" data-recalc-dims="1" data-lazy-src="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730.png?w=450&ssl=1" data-lazy-sizes="(max-width: 1004px) 100vw, 1004px"><noscript><img loading="lazy" decoding="async" src="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730.png?w=450&ssl=1" alt="Multiple elements from a page scraped with R" class="wp-image-3803" srcset_temp="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730.png?w=450&ssl=1 1004w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-300x56.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-150x28.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730-768x145.png 768w" sizes="(max-width: 1004px) 100vw, 1004px" data-recalc-dims="1" /></noscript></figure> <p>Still pretty simple, right? As always, let’s break it down.</p> <h3 class="wp-block-heading">Our Multiple-Element Scraper Broken Down</h3> <p>As you can see, we’ve used a function to pull some different elements from the page using the various nodes we’ve identified, and it’s given us some useful SEO data. Let’s dig in to how this function works.</p> <ul class="wp-block-list"> <li><strong>pageScrape <- function(x){: </strong>We’re creating our function called pageScrape with the x variable</li> <li><strong>pageContent <- read_html(x):</strong> Here, we’re getting the html of our target page into our environment using rvests read_html() function</li> <li><strong>metaTitle <- html_element(pageContent,, “//title”) %>% html_text():</strong> As we saw with our earlier meta title scraping command, we’re using XPath to pull the title with “//title” and using html_text() to just get the text</li> <li><strong>metaDescription <- html_attr(html_element(pageContent, “meta[name=’description’]”), “content”):</strong> Meta descriptions can often be one of the most annoying parts to scrape, and they sometimes differ between sites. In this case, we’re using html_attr() instead of html_text() because the meta description is stored within the content attribute</li> <li><strong>heading <- html_element(pageContent, “.entry-title”) %>% html_text():</strong> We’re re-using our H1 scraper from earlier, pulling the H1 using the .entry-title CSS selector</li> <li><strong>output <- data.frame(metaTitle, metaDescription, heading):</strong> Finally, we’re creating our output dataframe that puts the meta title, meta description and H1 into separate columns</li> </ul> <p>Now let’s think about how we can run this across multiple pages.</p> <h2 class="wp-block-heading">Applying R Scrapers To Multiple Pages</h2> <p>If you’ve read my previous posts on using <a class="wpil_keyword_link" href="https://www.ben-johnston.co.uk/r-for-seo-part-7-loops/" title="loops" data-wpil-keyword-link="linked" data-wpil-monitor-id="240" rel="nofollow" target="_blank">loops</a> and apply methods in R, you’ll have an idea of how we can run this across multiple pages.</p> <p>For consistency’s sake, let’s run it across all the pages on my site, since the elements will all be the same.</p> <p>Firstly, we want to get all of our target URLs into an object. The easiest way to do this is to scrape my pages sitemap.</p> <p>We can do that like so:</p> <pre>pagesSitemap <- read_html("https://www.ben-johnston.co.uk/page-sitemap.xml") %>% html_elements(, "//loc") %>% html_text()</pre> <p>Now if we use the <a href="https://www.rdocumentation.org/packages/purrr/versions/0.2.4/topics/reduce" rel="nofollow" target="_blank">reduce() </a>function from the tidyverse as we did in part 8, we can scrape all of the elements we discussed previously from all of my pages like so:</p> <pre>pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows)</pre> <p>I won’t break this particular one down, as it’s covered in depth in part 8.</p> <p>You’ll see a few NAs in there because featured images are included in the sitemap, but you get the idea. For a more robust method, you could subset using the methods we learned all the way back in <a href="https://www.ben-johnston.co.uk/r-for-seo-part-1-basics/" rel="nofollow" target="_blank">part 1</a>, like so:</p> <pre>pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, "wp-content") == FALSE)</pre> <p>Now we’ll only have the html pages. If we run our scraper again, we’ll get the following:</p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt class="wp-image-3806" srcset_temp="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858.png?w=450&ssl=1 990w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-300x95.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-150x48.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-768x244.png 768w" data-recalc-dims="1" data-lazy-src="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858.png?w=450&ssl=1" data-lazy-sizes="(max-width: 990px) 100vw, 990px"><noscript><img loading="lazy" decoding="async" src="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858.png?w=450&ssl=1" alt="" class="wp-image-3806" srcset_temp="https://i0.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858.png?w=450&ssl=1 990w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-300x95.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-150x48.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858-768x244.png 768w" sizes="(max-width: 990px) 100vw, 990px" data-recalc-dims="1" /></noscript></figure> <p>So now we know how to scrape multiple elements from multiple pages, let’s talk about how we do that politely.</p> <h2 class="wp-block-heading">Using The Polite R Package To Reduce Scrape Load</h2> <p>Scraping websites isn’t always the most popular thing with site owners. Sometimes they don’t want their content being used that way (I personally block ChatGPT for precisely that reason), and also scraping multiple pages can put a large amount of load on a server, sometimes costing them money or even making them think that their site is under attack.</p> <p>The <a href="https://dmi3kno.github.io/polite/" rel="nofollow" target="_blank">Polite package for R</a> ensures that your scraper respects robots.txt and also helps you reduce the amount of load you’re putting on a server. I’m giving you the techniques to scrape websites with R here, but it would be remiss of me not to tell you that you should do so politely and respect the owners of the sites.</p> <p>Lecture over, let’s get the Polite package installed.</p> <pre>install.packages("polite") library(polite)</pre> <h3 class="wp-block-heading">Using The Polite R Package’s Bow Function</h3> <p>The Polite package has two key functions: <code>bow</code> and <code>scrape</code>. It is <em>very</em> polite. Bow introduces your R environment to the server and asks permission to scrape, looking at the robots.txt file and scrape runs the scraper.</p> <p>The three tenets of a polite R scraping session are defined by the authors as “seeking permission, taking slowly and never asking twice” and that’s as good a definition of scraping the web as we’ll find.</p> <p>Let’s introduce ourselves to the target site using bow:</p> <pre>session <- bow("https://www.ben-johnston.co.uk")</pre> <p>If we inspect our session object now, we’ll see the following:<br><br></p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt="A polite R scraping session from the R console" class="wp-image-3808" srcset_temp="https://i2.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922.png?w=450&ssl=1 485w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922-300x66.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922-150x33.png 150w" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922.png?w=450&ssl=1" data-lazy-sizes="(max-width: 485px) 100vw, 485px"><noscript><img loading="lazy" decoding="async" src="https://i2.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922.png?w=450&ssl=1" alt="A polite R scraping session from the R console" class="wp-image-3808" srcset_temp="https://i2.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922.png?w=450&ssl=1 485w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922-300x66.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922-150x33.png 150w" sizes="(max-width: 485px) 100vw, 485px" data-recalc-dims="1" /></noscript></figure> <p>This has introduced us to the website and checked the robots.txt. Now let’s update our previous multiple page scraper to do so politely.</p> <h3 class="wp-block-heading">Using The Polite R Package’s Scrape Function On Multiple URLs</h3> <p>Now we understand about scraping politely, let’s update our previous <code>pageScrape</code> function to scrape our multiple URLs politely.</p> <p>This isn’t overly complicated and our updated function is as follows:</p> <pre>politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) }</pre> <p>It’s similar to our previous scraper function, isn’t it? But there’s one key difference: rather than using <code>read_html()</code>, we’re using the polite package’s <code>scrape() </code>function and creating a new bow() for every page, ensuring that we are introducing ourselves to each page and making sure that we’re respecting robots.txt and not hitting the site too hard.</p> <p>We can run it like so:</p> <pre>politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows)</pre> <p>And if we inspect our politeElements object in the console, we should see the following:</p> <figure class="wp-block-image size-full"><img loading="lazy" decoding="async" src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" alt="Output of a polite R scraping session on multiple pages" class="wp-image-3809" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953.png?w=450&ssl=1 999w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-300x95.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-150x48.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-768x244.png 768w" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953.png?w=450&ssl=1" data-lazy-sizes="(max-width: 999px) 100vw, 999px"><noscript><img loading="lazy" decoding="async" src="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953.png?w=450&ssl=1" alt="Output of a polite R scraping session on multiple pages" class="wp-image-3809" srcset_temp="https://i1.wp.com/www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953.png?w=450&ssl=1 999w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-300x95.png 300w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-150x48.png 150w, https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953-768x244.png 768w" sizes="(max-width: 999px) 100vw, 999px" data-recalc-dims="1" /></noscript></figure> <p>So there we have it, that’s how you can use R’s rvest and polite packages to scrape multiple pages while being a good internet user.</p> <h2 class="wp-block-heading">Wrapping Up</h2> <p>This was quite a long piece and we’re reaching the end of my <a class="wpil_keyword_link" href="https://www.ben-johnston.co.uk/category/r/r-seo/" title="R for SEO" data-wpil-keyword-link="linked" data-wpil-monitor-id="239" rel="nofollow" target="_blank">R for SEO</a> series (although there will definitely be a couple of bonus entries). I hope you’ve enjoyed today’s article on using R to scrape the web, you’ve learned to do so politely and you’re seeing some applications for using this in your SEO work.</p> <p>Until next time, where I’ll be talking about how we can build a reporting dashboard with R, Google Sheets and Google Looker Studio, using everything we’ve learned throughout this series.</p> <h3 class="wp-block-heading">Our Code From Today</h3> <pre># Install Packages install.packages("rvest") library(rvest) library(tidyverse) # Scrape Article Title With CSS Selector scrapeURL <- "https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/" articleTitle <- read_html(scrapeURL) %>% html_element(".entry-title") %>% html_text() # Scrape Meta Titles With XPath articleMetaTitle <- read_html(scrapeURL) %>% html_element(, "//title") %>% html_text() #Scrape Multiple Elements pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } pageElements <- pageScrape(scrapeURL) # Scrape Multiple Pages pagesSitemap <- read_html("https://www.ben-johnston.co.uk/page-sitemap.xml") %>% html_elements(, "//loc") %>% html_text() pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, "wp-content") == FALSE) pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows) # Scraping Politely install.packages("polite") library(polite) session <- bow("https://www.ben-johnston.co.uk") politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, "//title") %>% html_text() metaDescription <- html_attr(html_element(pageContent, "meta[name='description']"), "content") heading <- html_element(pageContent, ".entry-title") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows)</pre> <p><a class="a2a_button_linkedin" href="https://www.addtoany.com/add_to/linkedin?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="LinkedIn" rel="nofollow" target="_blank"></a><a class="a2a_button_whatsapp" href="https://www.addtoany.com/add_to/whatsapp?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="WhatsApp" rel="nofollow" target="_blank"></a><a class="a2a_button_email" href="https://www.addtoany.com/add_to/email?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="Email" rel="nofollow" target="_blank"></a><a class="a2a_button_copy_link" href="https://www.addtoany.com/add_to/copy_link?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="Copy Link" rel="nofollow" target="_blank"></a><a class="a2a_button_x" href="https://www.addtoany.com/add_to/x?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="X" rel="nofollow" target="_blank"></a><a class="a2a_button_reddit" href="https://www.addtoany.com/add_to/reddit?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="Reddit" rel="nofollow" target="_blank"></a><a class="a2a_button_facebook" href="https://www.addtoany.com/add_to/facebook?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="Facebook" rel="nofollow" target="_blank"></a><a class="a2a_button_bluesky" href="https://www.addtoany.com/add_to/bluesky?linkurl=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&linkname=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" title="Bluesky" rel="nofollow" target="_blank"></a><a class="a2a_dd addtoany_share_save addtoany_share" href="https://www.addtoany.com/share#url=https%3A%2F%2Fwww.ben-johnston.co.uk%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F&%23038;title=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest" data-a2a-url="https://www.ben-johnston.co.uk/r-for-seo-part-9-web-scraping-with-r-rvest/" data-a2a-title="R for SEO Part 9: Web Scraping With R & Rvest" rel="nofollow" target="_blank"></a></p><p>This post was written by Ben Johnston on <a href="https://www.ben-johnston.co.uk/" rel="nofollow" target="_blank">Ben Johnston</a></p> <div id='jp-relatedposts' class='jp-relatedposts' > <h3 class="jp-relatedposts-headline"><em>Related</em></h3> </div><aside class="mashsb-container mashsb-main mashsb-stretched"><div class="mashsb-box"><div class="mashsb-buttons"><a class="mashicon-facebook mash-large mash-center mashsb-noshadow" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.r-bloggers.com%2F2025%2F01%2Fr-for-seo-part-9-web-scraping-with-r-rvest%2F" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Share</span></a><a class="mashicon-twitter mash-large mash-center mashsb-noshadow" href="https://twitter.com/intent/tweet?text=R%20for%20SEO%20Part%209%3A%20Web%20Scraping%20With%20R%20%26%20Rvest&url=https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/&via=Rbloggers" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Tweet</span></a><div class="onoffswitch2 mash-large mashsb-noshadow" style="display:none"></div></div> </div> <div style="clear:both"></div></aside> <!-- Share buttons by mashshare.net - Version: 4.0.47--> <div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;"> <div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://www.ben-johnston.co.uk/r-for-seo-part-9-web-scraping-with-r-rvest/"> R | Ben Johnston</a></strong>.</div> <hr /> <a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>. <hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't. </div> </div> </article><nav class="post-navigation clearfix" role="navigation"> <div class="post-nav left"> <a href="https://www.r-bloggers.com/2025/01/decomposing-within-and-between-person-effects-in-longitudinal-data-with-sem-in-r-workshop/" rel="prev">← Previous post</a></div> <div class="post-nav right"> </div> </nav> </div> <aside class="mh-sidebar sb-right"> <div id="custom_html-2" class="widget_text sb-widget widget_custom_html"><div class="textwidget custom-html-widget"> <div class="top-search" style="padding-left: 0px;"> <form id="searchform" action="http://www.google.com/cse" target="_blank"> <div> <input type="hidden" name="cx" value="005359090438081006639:paz69t-s8ua" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" value="" name="q" id="q" autocomplete="on" style="font-size:16px;" placeholder="Search R-bloggers.." /> <input type="submit" id="searchsubmit2" name="sa" value="Go" style="font-size:16px;" /> </div> </form> </div> <!-- thanks: https://stackoverflow.com/questions/14981575/google-cse-with-a-custom-form https://stackoverflow.com/questions/10363674/change-size-of-text-in-text-input-tag --></div></div><div id="text-6" class="sb-widget widget_text"> <div class="textwidget"><div style="min-height:26px;border:1px solid #ccc;padding:3px;text-align:left; background: none repeat scroll 0 0 #FDEADA;"> <form style="width:202px; float:left;" action="https://r-bloggers.com/phplist/?p=subscribe&id=1" method="post" target="popupwindow"> <input type="text" style="width:110px" onclick="if (this.value == 'Your e-mail here') this.value = '';" value='Your e-mail here' name="email"/> <input type="hidden" value="RBloggers" name="uri"/><input type="hidden" name="loc" value="en_US"/><input type="submit" value="Subscribe" /> </form> <div> <a href="https://feeds.feedburner.com/RBloggers"><img src="https://www.r-bloggers.com/wp-content/plugins/jetpack/modules/lazy-images/images/1x1.trans.gif" style="height:17px;min-width:80px;class:skip-lazy;" alt data-recalc-dims="1" data-lazy-src="https://i1.wp.com/www.r-bloggers.com/wp-content/uploads/2020/07/RBloggers_feedburner_count_2020_07_01-e1593671704447.gif?w=578&ssl=1"><noscript><img src="https://i1.wp.com/www.r-bloggers.com/wp-content/uploads/2020/07/RBloggers_feedburner_count_2020_07_01-e1593671704447.gif?w=578&ssl=1" style="height:17px;min-width:80px;class:skip-lazy;" alt="" data-recalc-dims="1" /></noscript></a> </div> </div> <br/> <div> <script> function init() { var vidDefer = document.getElementsByTagName('iframe'); for (var i=0; i<vidDefer.length; i++) { if(vidDefer[i].getAttribute('data-src')) { vidDefer[i].setAttribute('src',vidDefer[i].getAttribute('data-src')); } } } window.onload = init; </script> <iframe allowtransparency="true" frameborder="0" scrolling="no" src="" data-src="//platform.twitter.com/widgets/follow_button.html?screen_name=rbloggers&data-show-count" style="width:100%; height:30px;"></iframe> <div id="fb-root"></div> <script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v7.0&appId=124112670941750&autoLogAppEvents=1" nonce="RysU23SE"></script> <div style="min-height: 154px;" class="fb-page" data-href="https://www.facebook.com/rbloggers/" data-tabs="" data-width="300" data-height="154" data-small-header="true" data-adapt-container-width="true" data-hide-cover="false" data-show-facepile="true"><blockquote cite="https://www.facebook.com/rbloggers/" class="fb-xfbml-parse-ignore"><a href="https://www.facebook.com/rbloggers/">R bloggers Facebook page</a></blockquote></div> <!-- <iframe src="" data-src="//www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FR-bloggers%2F191414254890&width=300&height=155&show_faces=true&colorscheme=light&stream=false&border_color&header=false&appId=400430016676958" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100%; height:140px;" allowTransparency="true"></iframe> --> <!-- <br/> <strong>If you are an R blogger yourself</strong> you are invited to <a href="https://www.r-bloggers.com/add-your-blog/">add your own R content feed to this site</a> (<strong>Non-English</strong> R bloggers should add themselves- <a href="https://www.r-bloggers.com/lang/add-your-blog">here</a>) --> </div></div> </div><div id="wppp-3" class="sb-widget widget_wppp"><h4 class="widget-title">Most viewed posts (weekly)</h4> <ul class='wppp_list'> <li><a href='https://www.r-bloggers.com/2022/01/how-to-install-and-update-r-and-rstudio/' title='How to install (and update!) R and RStudio'>How to install (and update!) R and RStudio</a></li> <li><a href='https://www.r-bloggers.com/2025/01/how-are-p-values-distributed-under-the-null/' title='How Are P-values Distributed Under The Null?'>How Are P-values Distributed Under The Null?</a></li> <li><a href='https://www.r-bloggers.com/2025/01/a-trip-from-variance-covariance-to-correlation-and-back/' title='A trip from variance-covariance to correlation and back'>A trip from variance-covariance to correlation and back</a></li> <li><a href='https://www.r-bloggers.com/2015/09/how-to-perform-a-logistic-regression-in-r/' title='How to perform a Logistic Regression in R'>How to perform a Logistic Regression in R</a></li> <li><a href='https://www.r-bloggers.com/2018/07/pca-vs-autoencoders-for-dimensionality-reduction/' title='PCA vs Autoencoders for Dimensionality Reduction'>PCA vs Autoencoders for Dimensionality Reduction</a></li> <li><a href='https://www.r-bloggers.com/2013/08/date-formats-in-r/' title='Date Formats in R'>Date Formats in R</a></li> <li><a href='https://www.r-bloggers.com/2025/01/shiny-in-production-2025/' title='Shiny in Production 2025'>Shiny in Production 2025</a></li> </ul> </div><div id="text-18" class="sb-widget widget_text"><h4 class="widget-title">Sponsors</h4> <div class="textwidget"><div style="min-height: 2055px;"> <script data-cfasync="false" type="text/javascript"> // https://support.cloudflare.com/hc/en-us/articles/200169436-How-can-I-have-Rocket-Loader-ignore-my-script-s-in-Automatic-Mode- // this must be placed higher. Otherwise it doesn't work. // data-cfasync="false" is for making sure cloudflares' rocketcache doesn't interfeare with this // in this case it only works because it was used at the original script in the text widget function createCookie(name,value,days) { var expires = ""; if (days) { var date = new Date(); date.setTime(date.getTime() + (days*24*60*60*1000)); expires = "; expires=" + date.toUTCString(); } document.cookie = name + "=" + value + expires + "; path=/"; } function readCookie(name) { var nameEQ = name + "="; var ca = document.cookie.split(';'); for(var i=0;i < ca.length;i++) { var c = ca[i]; while (c.charAt(0)==' ') c = c.substring(1,c.length); if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length); } return null; } function eraseCookie(name) { createCookie(name,"",-1); } // no longer use async because of google // async async function readTextFile(file) { // Helps people browse between pages without the need to keep downloading the same // ads txt page everytime. This way, it allows them to use their browser's cache. var random_number = readCookie("ad_random_number_cookie"); if(random_number == null) { var random_number = Math.floor(Math.random()*100*(new Date().getTime()/10000000000)); createCookie("ad_random_number_cookie",random_number,1) } file += '?t='+random_number; var rawFile = new XMLHttpRequest(); rawFile.onreadystatechange = function () { if(rawFile.readyState === 4) { if(rawFile.status === 200 || rawFile.status == 0) { // var allText = rawFile.responseText; // document.write(allText); document.write(rawFile.responseText); } } } rawFile.open("GET", file, false); rawFile.send(null); } // readTextFile('https://raw.githubusercontent.com/Raynos/file-store/master/temp.txt'); readTextFile("https://www.r-bloggers.com/wp-content/uploads/text-widget_anti-cache.txt"); </script> </div></div> </div> <div id="recent-posts-3" class="sb-widget widget_recent_entries"> <h4 class="widget-title">Recent Posts</h4> <ul> <li> <a href="https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/" aria-current="page">R for SEO Part 9: Web Scraping With R & Rvest</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/decomposing-within-and-between-person-effects-in-longitudinal-data-with-sem-in-r-workshop/">Decomposing within and between person effects in longitudinal data with SEM in R workshop</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/ropensci-news-digest-january-2025/">rOpenSci News Digest, January 2025</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/december-2024-top-40-new-cran-packages/">December 2024 Top 40 New CRAN Packages</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/gradient-boosting-and-boostrap-aggregating-anything-alert-high-performance-part5-easier-install-and-rust-backend/">Gradient-Boosting and Boostrap aggregating anything (alert: high performance): Part5, easier install and Rust backend</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/cpp11-pull-requests-to-improve-the-integration-of-r-and-c/">Cpp11 pull requests to improve the integration of R and C++</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/coworking-mini-hackathon-for-first-time-contributors-2/">Coworking Mini-Hackathon for First-Time Contributors</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/a-trip-from-variance-covariance-to-correlation-and-back/">A trip from variance-covariance to correlation and back</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/dysons-algorithm-the-general-case/">Dyson’s Algorithm: The General Case</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/shiny-in-production-2025/">Shiny in Production 2025</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/how-are-p-values-distributed-under-the-null/">How Are P-values Distributed Under The Null?</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/ropensci-2024-highlights/">rOpenSci 2024 Highlights</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/monte-carlo-exam/">Monte Carlo [exam]</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/remembering-friedrich-fritz-leisch/">Remembering Friedrich “Fritz” Leisch</a> </li> <li> <a href="https://www.r-bloggers.com/2025/01/coworking-mini-hackathon-for-first-time-contributors/">Coworking Mini-Hackathon for First-Time Contributors</a> </li> </ul> </div><div id="rss-7" class="sb-widget widget_rss"><h4 class="widget-title"><a class="rsswidget" href="https://feeds.feedburner.com/Rjobs"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://www.r-users.com/">Jobs for R-users</a></h4><ul><li><a class='rsswidget' href='https://www.r-users.com/jobs/r-adoption-lead/'>R Adoption Lead</a></li><li><a class='rsswidget' href='https://www.r-users.com/jobs/r-system-developer-for-the-institute-of-marine-research-imr-bergen-vestland-norway/'>R System Developer for The Institute of Marine Research (IMR) @ Bergen, Vestland, Norway</a></li><li><a class='rsswidget' href='https://www.r-users.com/jobs/principal-machine-learning-engineer-new-york-united-states/'>Principal Machine Learning Engineer @ New York, United States</a></li><li><a class='rsswidget' href='https://www.r-users.com/jobs/statistical-programmer-for-i360-arlington-virginia-united-states/'>Statistical Programmer for i360 @ Arlington, Virginia, United States</a></li><li><a class='rsswidget' href='https://www.r-users.com/jobs/biostatistician-ii/'>Biostatistician II</a></li></ul></div><div id="rss-9" class="sb-widget widget_rss"><h4 class="widget-title"><a class="rsswidget" href="https://feeds.feedburner.com/Python-bloggers"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://python-bloggers.com/">python-bloggers.com (python/data-science news)</a></h4><ul><li><a class='rsswidget' href='https://python-bloggers.com/2024/12/exploratory-and-predictive-data-analysis-turning-passive-data-into-actionable-insights/'>Exploratory and Predictive Data Analysis: Turning Passive Data into Actionable Insights</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/12/mastering-playwright-for-modern-browser-automation/'>Mastering Playwright for Modern Browser Automation</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/12/scraping-and-not-modified-responses/'>Scraping and Not Modified Responses</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/12/mastering-idempotency-in-data-analytics-ensuring-reliable-pipelines/'>Mastering Idempotency in Data Analytics: Ensuring Reliable Pipelines</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/12/understanding-why-sklearn-pca-differs-from-scratch-implementations/'>Understanding Why Sklearn PCA Differs from Scratch Implementations</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/11/how-to-calculate-z-scores-in-python/'>How to calculate Z-Scores in Python</a></li><li><a class='rsswidget' href='https://python-bloggers.com/2024/11/r-shiny-deployment-5-options-for-individuals-and-enterprises/'>R Shiny Deployment: 5 Options for Individuals and Enterprises</a></li></ul></div><div id="text-16" class="sb-widget widget_text"> <div class="textwidget"><strong><a href="https://www.r-bloggers.com/blogs-list/">Full list of contributing R-bloggers</a></strong></div> </div><div id="annual_archive_widget-2" class="sb-widget Annual_Archive_Widget"><h4 class="widget-title">R Posts by Year</h4> <select name="archive-dropdown" onchange='document.location.href=this.options[this.selectedIndex].value;'> <option value="">Select Year</option> <option value='https://www.r-bloggers.com/2025/'> 2025  (54)</option> <option value='https://www.r-bloggers.com/2024/'> 2024  (1574)</option> <option value='https://www.r-bloggers.com/2023/'> 2023  (1812)</option> <option value='https://www.r-bloggers.com/2022/'> 2022  (2075)</option> <option value='https://www.r-bloggers.com/2021/'> 2021  (2496)</option> <option value='https://www.r-bloggers.com/2020/'> 2020  (3299)</option> <option value='https://www.r-bloggers.com/2019/'> 2019  (3086)</option> <option value='https://www.r-bloggers.com/2018/'> 2018  (3479)</option> <option value='https://www.r-bloggers.com/2017/'> 2017  (3718)</option> <option value='https://www.r-bloggers.com/2016/'> 2016  (3477)</option> <option value='https://www.r-bloggers.com/2015/'> 2015  (2858)</option> <option value='https://www.r-bloggers.com/2014/'> 2014  (2894)</option> <option value='https://www.r-bloggers.com/2013/'> 2013  (3141)</option> <option value='https://www.r-bloggers.com/2012/'> 2012  (3382)</option> <option value='https://www.r-bloggers.com/2011/'> 2011  (2847)</option> <option value='https://www.r-bloggers.com/2010/'> 2010  (2046)</option> <option value='https://www.r-bloggers.com/2009/'> 2009  (706)</option> <option value='https://www.r-bloggers.com/2008/'> 2008  (107)</option> <option value='https://www.r-bloggers.com/2007/'> 2007  (70)</option> <option value='https://www.r-bloggers.com/2006/'> 2006  (19)</option> <option value='https://www.r-bloggers.com/2005/'> 2005  (5)</option> </select> </div></aside></div> </div> <div class="copyright-wrap"> <p class="copyright">Copyright © 2025 | <a href="https://www.mhthemes.com/" rel="nofollow">MH Corporate basic by MH Themes</a></p> </div> </div> <!-- TPC! Memory Usage (http://webjawns.com) Memory Usage: 19184232 Memory Peak Usage: 19702688 WP Memory Limit: 820M PHP Memory Limit: 128M Checkpoints: 9 --> <script type="application/ld+json" class="saswp-schema-markup-output"> [{"@context":"https://schema.org/","@graph":[{"@type":"Organization","@id":"https://www.r-bloggers.com#Organization","name":"R-bloggers","url":"https://www.r-bloggers.com","sameAs":[],"logo":{"@type":"ImageObject","url":"http://www.r-bloggers.com/wp-content/uploads/2021/05/R_blogger_logo1_large.png","width":"1285","height":"369"},"contactPoint":{"@type":"ContactPoint","contactType":"technical support","telephone":"","url":"https://www.r-bloggers.com/contact-us/"}},{"@type":"WebSite","@id":"https://www.r-bloggers.com#website","headline":"R-bloggers","name":"R-bloggers","description":"R news and tutorials contributed by hundreds of R bloggers","url":"https://www.r-bloggers.com","potentialAction":{"@type":"SearchAction","target":"https://www.r-bloggers.com?s={search_term_string}","query-input":"required name=search_term_string"},"publisher":{"@id":"https://www.r-bloggers.com#Organization"}},{"@context":"https://schema.org/","@type":"WebPage","@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#webpage","name":"R for SEO Part 9: Web Scraping With R & Rvest","url":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/","lastReviewed":"2025-01-27T13:02:17-06:00","dateCreated":"2025-01-27T13:02:17-06:00","inLanguage":"en-US","description":"R for SEO Part 9: Web Scraping With R & Rvest Hello, and welcome back. We’re (finally) in the home stretch of our R for SEO series with part nine, where we’re talking... This post was written by Ben Johnston on Ben Johnston","reviewedBy":{"@type":"Organization","name":"R-bloggers","url":"https://www.r-bloggers.com","logo":{"@type":"ImageObject","url":"http://www.r-bloggers.com/wp-content/uploads/2021/05/R_blogger_logo1_large.png","width":"1285","height":"369"}},"primaryImageOfPage":{"@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#primaryimage"},"mainContentOfPage":[[{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"Home","url":"https://www.r-bloggers.com"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"About","url":"http://www.r-bloggers.com/about/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"RSS","url":"https://feeds.feedburner.com/RBloggers"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"add your blog!","url":"http://www.r-bloggers.com/add-your-blog/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"Learn R","url":"https://www.r-bloggers.com/2015/12/how-to-learn-r-2/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"R jobs","url":"https://www.r-users.com/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"Submit a new job (it's free)","url":"https://www.r-users.com/submit-job/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"Browse latest jobs (also free)","url":"https://www.r-users.com/"},{"@context":"https://schema.org/","@type":"SiteNavigationElement","@id":"https://www.r-bloggers.com#top nav","name":"Contact us","url":"http://www.r-bloggers.com/contact-us/"}]],"isPartOf":{"@id":"https://www.r-bloggers.com#website"},"breadcrumb":{"@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#breadcrumb"}},{"@type":"BreadcrumbList","@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https://www.r-bloggers.com","name":"R-bloggers"}},{"@type":"ListItem","position":2,"item":{"@id":"https://www.r-bloggers.com/category/r-bloggers/","name":"R bloggers"}},{"@type":"ListItem","position":3,"item":{"@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/","name":"R for SEO Part 9: Web Scraping With R & Rvest"}}]},{"@type":"Article","@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#Article","url":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/","inLanguage":"en-US","mainEntityOfPage":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#webpage","headline":"R for SEO Part 9: Web Scraping With R & Rvest","description":"R for SEO Part 9: Web Scraping With R & Rvest Hello, and welcome back. We’re (finally) in the home stretch of our R for SEO series with part nine, where we’re talking... This post was written by Ben Johnston on Ben Johnston","articleBody":"R for SEO Part 9: Web Scraping With R & Rvest Hello, and welcome back. We’re (finally) in the home stretch of our R for SEO series with part nine, where we’re talking about scraping the web using R, particularly using the rvest package. Today, we’re going to discuss the rvest package, look at the different scraping methods available, pull data from multiple pages and then look at how we can do so without bringing our target sites down. This is going to be fairly important for our final piece, so it’s worth paying attention today. As always, this is a long piece, so feel free to use the table of contents below and do please sign up for my email list, where you’ll get updates of fresh content for free. \t Leave this field empty if you’re human: The Rvest Package The rvest package for R – another Hadley Wickham creation – is the most commonly used web scraping package for the R language, and it’s easy to see why. It brings a lot of the Tidyverse’s tidy data outputs and notation functionality to what can be a complex element of data analysis and SEO. I’m a fan. It’s not included in the Tidyverse package, so you’ll need to install it separately. You can do that like so: install.packages(\"rvest\") library(rvest) This will install the rvest package and get it initialised. It’s also worth installing the Tidyverse as always. library(tidyverse) Now we have rvest installed, we can get scraping. But how do we find what to scrape from a page? That’s where we start needing to understand a bit about how scraping works and how to identify our data points. XPath & CSS Selectors XPath and CSS selectors are the most widely used ways of identifying elements on a page, and are both crucial to understand when it comes to scraping the web. Personally, since I’m a little bit more old-school, I tend to default to XPath, but that’s not to say CSS selectors aren’t brilliant – they certainly help you write much more efficient code. Let’s look at the differences between the two. What Is XPath? XPath – or XML Path Language – is a syntax used for selecting nodes in a document. Think of it like a way to pinpoint specific parts of an XML tree structure, like finding particular branches or leaves. XPath has been around a long time and is very powerful, and rvest has great support for it.There are a number of ways that you can find the relevant XPath query that you’ll need to scrape your chosen elements from your pages, and we’ll cover that very shortly. What Are CSS Selectors? CSS selectors are the visual language of the web, used to style and structure web pages, but they can do much more than just create pretty sites. In R, you can leverage CSS selectors to extract the almost any data you need from web pages. Again, rvest has a lot of native support for CSS selectors, and they tend to be a lot more efficient to write queries with, as well as needing less debugging. Both methods are completely valid and, truthfully, while I tend to default to XPath, CSS selectors are finding their way into my work a lot more due to having to write less code. But what are the key differences? XPath Vs CSS Selectors Using Vs is a bit of a misnomer here – it’s not a fight, because they’re both worth using, but it’s worth understanding a little bit more about the differences between the two and when to use them. In general terms – there are always exceptions – CSS selectors are: Easier & more efficient to write: A CSS selector is much shorter than an XPath query, and usually more intuitive, especially if you’ve been doing a lot of technical SEO work over your career Faster to run: Especially if you’re using browser-native CSS, which can make large-scale scraping projects much quicker Primarily designed for HTML pages: They’re not great at navigating up the DOM tree, and they’re not always so good for specific text elements Conversely, XPath tends to be: More powerful: You can do very complex queries with XPath, which can allow you to navigate the DOM tree in both directions, as well as scraping elements from parts of the site that don’t have CSS attached – more on that shortly More complex: Due to the power, an XPath query will generally be longer and more complex to write and maintain – but don’t let that put you off So now we know the two key methods we’re going to be using to identify the elements we’ll be scraping today, let’s talk about how to find them. Finding CSS Selectors & XPath Queries Obviously, it’s all well and good saying you want to use R to scrape a certain part of a page, such as the H1 or image alt text, but how do you actually find the elements you need? How do you identify the CSS selector without spending ages digging through the code and how do you go about putting an XPath query together? Fortunately, there are a few very quick and easy ways to do that using Chrome. Let’s take a look at a couple of them. The SelectorGadget Chrome Extension SelectorGadget is a handy, free Chrome extension that I use a lot. It’s very quick and easy to use and can help you find your CSS selector or XPath at the click of a button. It’s not perfect – few things are – but I find it hits more than it misses. Install it into your Chrome browser and then go to the page you want to scrape an element from. Let’s look at my last post in this series. Now let’s say we want to scrape the article title. Click the SelectorGadget extension and hover over the article title like so: You’ll see that it’s highlighted the selector. If you look into the SelectorGadget bar and click the title, it’ll put the following in there: And there we go, we have our CSS selector. Told you it was easy. Now if you click on the XPath button to the right of the extension, like so: You’ll get a popup with the XPath query you need. Again, these aren’t always perfect and sometimes won’t work the way they should, but they’ll give you a good starting point. But what if, for some reason, you can’t install the extension, or you find it doesn’t work the way it should on certain sites? Fortunately, there’s another option that’s built right into Chrome and most other browsers. Finding CSS Selectors & XPath Queries With Chrome Developer Tools I’m sure if you’ve been working in SEO for a while, you’ve become very familiar with Chrome or other browsers’ Developer Tools window. I don’t think I go a day without it, but did you know that you can also use it to find the CSS Selectors or create an XPath query based on a specific element? You probably did, but let’s talk through it anyway. Personally, I tend to use this more when I’m trying to build a scraper for an area that either doesn’t have CSS to it, such as something in the head, or when something stops SelectorGadget working effectively, but it’s definitely worth learning how to use it to help you as you get more familiar with scraping in R. Go back to the page we discussed in the last section. Now highlight a part of the article title and right-click. Now click “Inspect”, like so: This will bring up the familiar “Elements” window: Now if we right click on the element we care about – our article title, in this case, you’ll see the following dialogue. If you hover over “Copy”, it’ll pop out with a few very handy options: You can copy your CSS selector or your XPath query directly from here, ready to paste in your code. I told you it was handy! Again, it’s also not perfect, but between developer tools and SelectorGadget, I can pretty much always find the right element notation for whatever I’m trying to scrape. I’m conscious that we’re getting scarily close to 1,500 words and we haven’t written a single line of actual code yet! Let’s fix that by using R to scrape the article title with the CSS selector. Scraping Article Titles With R & CSS Selectors One of the things that is always worth remembering about using CSS selectors is that they will typically differ between websites, due to the fact that CSS styling tends to be individual to the site in question. That’s why tools like SelectorGadget or Chrome Dev Tools are so useful. Let’s write our first scraper using R’s rvest package to scrape the article title from my last post. First, create an object of the URL to that post, like so: scrapeURL <- \"https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/\" Now we want to find our CSS selector. Do that with SelectorGadget, and our selector will be as follows: .entry-title Alright, we’re ready. Let’s create a very simple scraping call: articleTitle <- read_html(scrapeURL) %>% html_element(\".entry-title\") %>% html_text() Run that in your console, now type articleTitle and you should see the following output: Easy, right? Let’s take a look at how it works. Our First Scraper Command Rvest was created by the same brain behind the tidyverse (you may have guessed that I’m a fan of Mr Wickham’s work by now), and you can see certain similarities in other commands that we’ve written throughout this series. As always, let’s break it down: articleTitle <-: We’re giving our object name the incredibly inventive name of “articleTitle” read_html(scrapeURL): Now we’re calling the read_html() function from rvest to download the html of our article that we put into scrapeURL %>%: We’re using the tidyverse’s pipe parameter to create multiple commands in one – you’ll have seen me use this throughout this series, but it’s always good to remind ourselves html_element(“.entry-title”): We’re telling rvest that we want to look specifically for the element (our CSS selector in this case) “.entry-title” %>% html_text(): Finally, we’re chaining in the html_text() command to only show us the text from the element we’ve selected And there you have it – we’ve built our very first scraper. Now let’s look at how we can use XPath from R’s rvest package to scrape different elements. Scraping Meta Titles With R & XPath Now let’s look at how we can use XPath with R and the rvest package to pull the meta title from the same page. This is a nice, simple command, but hopefully it’ll give you an idea of how the power of XPath can be added to your web scraping work. As before, we’re going to use my last post as the target page, so our scrapeURL object is still valid. Since we’re going for the meta title in this part rather than something visible on the page, SelectorGadget isn’t going to help us. Fortunately, we know how to use Chrome Dev Tools to find this. Go to your target URL and right click anywhere on the page to bring up the elements window. Now find and expand the <head> element. We want to focus on the <title> element here, so find that and right click on it, like so: Select “Copy XPath” and paste it somewhere. Now we want to use the following command: articleMetaTitle <- read_html(scrapeURL) %>% html_element(, \"//title\") %>% html_text() Not too dissimilar to our previous scraping command, is it? But there are a couple of key differences, which we’ll break down shortly. After this runs, type articleMetaTitle into your console, and you should see the following: And there we have it – our meta title pulled into our R environment. As always, let’s investigate the command. Our Meta Title Scraping Command Broken Down As you can see, using XPath on elements isn’t too dissimilar to CSS selectors when we’re using R to scrape the web – albeit, I’ve used the simplest XPath command I could have for this example, but hopefully you’re starting to see how this can be used for your SEO work. Let’s break this command down: read_html(scrapeURL) %>%: As before, we’ve created our articleMetaTitle object and used read_html on our scrapeURL page object and then used the pipe command to link our command up with the next html_element(, “//title”) %>%: This is where the key difference between using XPath and CSS selectors comes in – the comma. The comma tells R that we’re not using a CSS selector, but rather XPath. In this case, we’re using a very basic XPath command to scrape the title element and then we’re chaining to our next command html_text(): As before, we’re using html_text() to only show us the text of our scraped data This is quite easy, right? Building scrapers in R isn’t actually as complicated as it sounds, but obviously all we’ve done so far is pull one element at a time from a specific page. Now let’s look at how we can pull multiple elements from a page. Scraping Titles, Descriptions & H1s From A Page Using R Obviously, when we think about scraping the web, and using a programming language like R to do it, we’re thinking about scraping more than one thing at a time and getting them to a useful dataframe. So far, we’ve looked at using specific elements, but the point is scale – so now, let’s take a look at how we can scrape multiple elements from a page using R and rvest. We’re going to build a function to scrape the meta title, meta description and the H1 from my last post. After that, we’ll look at scraping multiple elements from multiple pages. Firstly, we need to get our elements. You can do that with a combination of Chrome Dev Tools and SelectorGadget, and with the power of regular expressions, we can end up with the following list of elements: //title: The meta title XPath “meta”), “content”: The attribute for the meta description .entry-title: The selector for gathering the article title, the H1 Now we’ve got our list of elements, let’s get to work on our function. It’ll look a little something like this: pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, \"//title\") %>% html_text() metaDescription <- html_attr(html_element(pageContent, \"meta\"), \"content\") heading <- html_element(pageContent, \".entry-title\") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } You can run this function like so: pageElements <- pageScrape(scrapeURL) And it’ll give you the following output: Still pretty simple, right? As always, let’s break it down. Our Multiple-Element Scraper Broken Down As you can see, we’ve used a function to pull some different elements from the page using the various nodes we’ve identified, and it’s given us some useful SEO data. Let’s dig in to how this function works. pageScrape <- function(x){: We’re creating our function called pageScrape with the x variable pageContent <- read_html(x): Here, we’re getting the html of our target page into our environment using rvests read_html() function metaTitle <- html_element(pageContent,, “//title”) %>% html_text(): As we saw with our earlier meta title scraping command, we’re using XPath to pull the title with “//title” and using html_text() to just get the text metaDescription <- html_attr(html_element(pageContent, “meta”), “content”): Meta descriptions can often be one of the most annoying parts to scrape, and they sometimes differ between sites. In this case, we’re using html_attr() instead of html_text() because the meta description is stored within the content attribute heading <- html_element(pageContent, “.entry-title”) %>% html_text(): We’re re-using our H1 scraper from earlier, pulling the H1 using the .entry-title CSS selector output <- data.frame(metaTitle, metaDescription, heading): Finally, we’re creating our output dataframe that puts the meta title, meta description and H1 into separate columns Now let’s think about how we can run this across multiple pages. Applying R Scrapers To Multiple Pages If you’ve read my previous posts on using loops and apply methods in R, you’ll have an idea of how we can run this across multiple pages. For consistency’s sake, let’s run it across all the pages on my site, since the elements will all be the same. Firstly, we want to get all of our target URLs into an object. The easiest way to do this is to scrape my pages sitemap. We can do that like so: pagesSitemap <- read_html(\"https://www.ben-johnston.co.uk/page-sitemap.xml\") %>% html_elements(, \"//loc\") %>% html_text() Now if we use the reduce() function from the tidyverse as we did in part 8, we can scrape all of the elements we discussed previously from all of my pages like so: pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows) I won’t break this particular one down, as it’s covered in depth in part 8. You’ll see a few NAs in there because featured images are included in the sitemap, but you get the idea. For a more robust method, you could subset using the methods we learned all the way back in part 1, like so: pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, \"wp-content\") FALSE) Now we’ll only have the html pages. If we run our scraper again, we’ll get the following: So now we know how to scrape multiple elements from multiple pages, let’s talk about how we do that politely. Using The Polite R Package To Reduce Scrape Load Scraping websites isn’t always the most popular thing with site owners. Sometimes they don’t want their content being used that way (I personally block ChatGPT for precisely that reason), and also scraping multiple pages can put a large amount of load on a server, sometimes costing them money or even making them think that their site is under attack. The Polite package for R ensures that your scraper respects robots.txt and also helps you reduce the amount of load you’re putting on a server. I’m giving you the techniques to scrape websites with R here, but it would be remiss of me not to tell you that you should do so politely and respect the owners of the sites. Lecture over, let’s get the Polite package installed. install.packages(\"polite\") library(polite) Using The Polite R Package’s Bow Function The Polite package has two key functions: bow and scrape. It is very polite. Bow introduces your R environment to the server and asks permission to scrape, looking at the robots.txt file and scrape runs the scraper. The three tenets of a polite R scraping session are defined by the authors as “seeking permission, taking slowly and never asking twice” and that’s as good a definition of scraping the web as we’ll find. Let’s introduce ourselves to the target site using bow: session <- bow(\"https://www.ben-johnston.co.uk\") If we inspect our session object now, we’ll see the following: This has introduced us to the website and checked the robots.txt. Now let’s update our previous multiple page scraper to do so politely. Using The Polite R Package’s Scrape Function On Multiple URLs Now we understand about scraping politely, let’s update our previous pageScrape function to scrape our multiple URLs politely. This isn’t overly complicated and our updated function is as follows: politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, \"//title\") %>% html_text() metaDescription <- html_attr(html_element(pageContent, \"meta\"), \"content\") heading <- html_element(pageContent, \".entry-title\") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } It’s similar to our previous scraper function, isn’t it? But there’s one key difference: rather than using read_html(), we’re using the polite package’s scrape() function and creating a new bow() for every page, ensuring that we are introducing ourselves to each page and making sure that we’re respecting robots.txt and not hitting the site too hard. We can run it like so: politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows) And if we inspect our politeElements object in the console, we should see the following: So there we have it, that’s how you can use R’s rvest and polite packages to scrape multiple pages while being a good internet user. Wrapping Up This was quite a long piece and we’re reaching the end of my R for SEO series (although there will definitely be a couple of bonus entries). I hope you’ve enjoyed today’s article on using R to scrape the web, you’ve learned to do so politely and you’re seeing some applications for using this in your SEO work. Until next time, where I’ll be talking about how we can build a reporting dashboard with R, Google Sheets and Google Looker Studio, using everything we’ve learned throughout this series. Our Code From Today # Install Packages install.packages(\"rvest\") library(rvest) library(tidyverse) # Scrape Article Title With CSS Selector scrapeURL <- \"https://www.ben-johnston.co.uk/r-for-seo-part-8-apply-methods-in-r/\" articleTitle <- read_html(scrapeURL) %>% html_element(\".entry-title\") %>% html_text() # Scrape Meta Titles With XPath articleMetaTitle <- read_html(scrapeURL) %>% html_element(, \"//title\") %>% html_text() #Scrape Multiple Elements pageScrape <- function(x){ pageContent <- read_html(x) metaTitle <- html_element(pageContent,, \"//title\") %>% html_text() metaDescription <- html_attr(html_element(pageContent, \"meta\"), \"content\") heading <- html_element(pageContent, \".entry-title\") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } pageElements <- pageScrape(scrapeURL) # Scrape Multiple Pages pagesSitemap <- read_html(\"https://www.ben-johnston.co.uk/page-sitemap.xml\") %>% html_elements(, \"//loc\") %>% html_text() pagesSitemap <- subset(pagesSitemap, str_detect(pagesSitemap, \"wp-content\") FALSE) pagesElements <- reduce(lapply(pagesSitemap, pageScrape), bind_rows) # Scraping Politely install.packages(\"polite\") library(polite) session <- bow(\"https://www.ben-johnston.co.uk\") politeScrape <- function(x){ session <- bow(x) pageContent <- scrape(session) metaTitle <- html_element(pageContent,, \"//title\") %>% html_text() metaDescription <- html_attr(html_element(pageContent, \"meta\"), \"content\") heading <- html_element(pageContent, \".entry-title\") %>% html_text() output <- data.frame(metaTitle, metaDescription, heading) } politeElements <- reduce(lapply(pagesSitemap, politeScrape), bind_rows) This post was written by Ben Johnston on Ben Johnston","keywords":"","datePublished":"2025-01-27T13:02:17-06:00","dateModified":"2025-01-27T13:02:17-06:00","author":{"@type":"Person","name":"Ben Johnston","url":"https://www.r-bloggers.com/author/ben-johnston/","sameAs":["https://www.ben-johnston.co.uk"],"image":{"@type":"ImageObject","url":"https://secure.gravatar.com/avatar/cfd4df7784dabe764a2d731c8559cd6b?s=96&d=mm&r=g","height":96,"width":96}},"editor":{"@type":"Person","name":"Ben Johnston","url":"https://www.r-bloggers.com/author/ben-johnston/","sameAs":["https://www.ben-johnston.co.uk"],"image":{"@type":"ImageObject","url":"https://secure.gravatar.com/avatar/cfd4df7784dabe764a2d731c8559cd6b?s=96&d=mm&r=g","height":96,"width":96}},"publisher":{"@id":"https://www.r-bloggers.com#Organization"},"image":[{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-173759.png","width":875,"height":241,"caption":"SelectorGadget highlighting article title","@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#primaryimage"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-173813.png","width":409,"height":39,"caption":"SelectorGadget CSS selector highlight"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174051.png","width":279,"height":45,"caption":"XPath highlight in SelectorGadget"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174104.png","width":425,"height":186,"caption":"XPath popup from SelectorGadget"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174134.png","width":622,"height":697,"caption":"Using Chrome Developer Tools on an article title"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174151-1024x250.png","width":1024,"height":250,"caption":"Chrome Developer Tools highlights article title"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174211.png","width":653,"height":589,"caption":"Chrome Developer Tools find CSS Selector or XPath"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174245.png","width":439,"height":48,"caption":"R scraping article title output"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174428.png","width":979,"height":390,"caption":"Copying meta title XPath in Chrome Developer Tools"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174449.png","width":674,"height":40,"caption":"Article meta title scraped in R console"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174730.png","width":1004,"height":189,"caption":"Multiple elements from a page scraped with R"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174858.png","width":990,"height":314},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174922.png","width":485,"height":107,"caption":"A polite R scraping session from the R console"},{"@type":"ImageObject","url":"https://www.ben-johnston.co.uk/wp-content/uploads/2025/01/Screenshot-2025-01-27-174953.png","width":999,"height":318,"caption":"Output of a polite R scraping session on multiple pages"}],"isPartOf":{"@id":"https://www.r-bloggers.com/2025/01/r-for-seo-part-9-web-scraping-with-r-rvest/#webpage"}}]}] </script> <script> var snp_f = []; var snp_hostname = new RegExp(location.host); var snp_http = new RegExp("^(http|https)://", "i"); var snp_cookie_prefix = ''; var snp_separate_cookies = false; var snp_ajax_url = 'https://www.r-bloggers.com/wp-admin/admin-ajax.php'; var snp_domain_url = 'https://www.r-bloggers.com'; var snp_ajax_nonce = 'ad2fbdae54'; var snp_ajax_ping_time = 1000; var snp_ignore_cookies = false; var snp_enable_analytics_events = true; var snp_is_mobile = false; var snp_enable_mobile = false; var snp_use_in_all = false; var snp_excluded_urls = []; var snp_close_on_esc_key = false; </script> <div class="snp-root"> <input type="hidden" id="snp_popup" value="" /> <input type="hidden" id="snp_popup_id" value="" /> <input type="hidden" id="snp_popup_theme" value="" /> <input type="hidden" id="snp_exithref" value="" /> <input type="hidden" id="snp_exittarget" value="" /> <div id="snppopup-welcome" class="snp-pop-109583 snppopup"><input type="hidden" class="snp_open" value="scroll" /><input type="hidden" class="snp_close" value="manual" /><input type="hidden" class="snp_show_on_exit" value="2" /><input type="hidden" class="snp_exit_js_alert_text" value="" /><input type="hidden" class="snp_exit_scroll_down" value="10" /><input type="hidden" class="snp_exit_scroll_up" value="10" /><input type="hidden" class="snp_open_scroll" value="45" /><input type="hidden" class="snp_close_scroll" value="10" /><input type="hidden" class="snp_optin_redirect_url" value="" /><input type="hidden" class="snp_optin_form_submit" value="single" /><input type="hidden" class="snp_show_cb_button" value="yes" /><input type="hidden" class="snp_popup_id" value="109583" /><input type="hidden" class="snp_popup_theme" value="theme6" /><input type="hidden" class="snp_overlay" value="disabled" /><input type="hidden" class="snp_cookie_conversion" value="0" /><input type="hidden" class="snp_cookie_close" value="180" /><div class="snp-fb snp-theme6"> <div class="snp-subscribe-inner"> <h1 class="snp-header"><i>Never miss an update! </i> <br/> <strong>Subscribe to R-bloggers</strong> to receive <br/>e-mails with the latest R posts.<br/> <small>(You will not see this message again.)</small></h1> <div class="snp-form"> <form action="https://r-bloggers.com/phplist/?p=subscribe&id=1&email=" method="post" class=" snp-subscribeform snp_subscribeform" target="_blank"> <input type="hidden" name="np_custom_name1" value="" /> <input type="hidden" name="np_custom_name2" value="" /> <fieldset> <div class="snp-field"> <input type="text" name="email" id="snp_email" placeholder="Your E-mail..." class="snp-field snp-field-email" /> </div> <button type="submit" class="snp-submit">Submit</button> </fieldset> </form> </div> <a href="#" class="snp_nothanks snp-close">Click here to close (This popup will not appear again)</a> </div> </div> <style>.snp-pop-109583 .snp-theme6 { max-width: 700px;} .snp-pop-109583 .snp-theme6 h1 {font-size: 17px;} .snp-pop-109583 .snp-theme6 { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field ::-webkit-input-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field :-moz-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field :-ms-input-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field input { border: 1px solid #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field { color: #000000;} .snp-pop-109583 .snp-theme6 { background: #f2f2f2;} </style></div> </div> <script type="text/javascript">/* <![CDATA[ */!function(e,n){var r={"selectors":{"block":"pre","inline":"code"},"options":{"indent":4,"ampersandCleanup":true,"linehover":true,"rawcodeDbclick":false,"textOverflow":"scroll","linenumbers":false,"theme":"enlighter","language":"r","retainCssClasses":false,"collapse":false,"toolbarOuter":"","toolbarTop":"{BTN_RAW}{BTN_COPY}{BTN_WINDOW}{BTN_WEBSITE}","toolbarBottom":""},"resources":["https:\/\/www.r-bloggers.com\/wp-content\/plugins\/enlighter\/cache\/enlighterjs.min.css?4WJrVky+dDEQ83W","https:\/\/www.r-bloggers.com\/wp-content\/plugins\/enlighter\/resources\/enlighterjs\/enlighterjs.min.js"]},o=document.getElementsByTagName("head")[0],t=n&&(n.error||n.log)||function(){};e.EnlighterJSINIT=function(){!function(e,n){var r=0,l=null;function c(o){l=o,++r==e.length&&(!0,n(l))}e.forEach(function(e){switch(e.match(/\.([a-z]+)(?:[#?].*)?$/)[1]){case"js":var n=document.createElement("script");n.onload=function(){c(null)},n.onerror=c,n.src=e,n.async=!0,o.appendChild(n);break;case"css":var r=document.createElement("link");r.onload=function(){c(null)},r.onerror=c,r.rel="stylesheet",r.type="text/css",r.href=e,r.media="all",o.appendChild(r);break;default:t("Error: invalid file extension",e)}})}(r.resources,function(e){e?t("Error: failed to dynamically load EnlighterJS resources!",e):"undefined"!=typeof EnlighterJS?EnlighterJS.init(r.selectors.block,r.selectors.inline,r.options):t("Error: EnlighterJS resources not loaded yet!")})},(document.querySelector(r.selectors.block)||document.querySelector(r.selectors.inline))&&e.EnlighterJSINIT()}(window,console); /* ]]> */</script><script type='text/javascript' src='https://www.r-bloggers.com/wp-includes/js/jquery/ui/core.min.js?ver=1.11.4' id='jquery-ui-core-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-includes/js/jquery/ui/datepicker.min.js?ver=1.11.4' id='jquery-ui-datepicker-js'></script> <script type='text/javascript' id='jquery-ui-datepicker-js-after'> jQuery(document).ready(function(jQuery){jQuery.datepicker.setDefaults({"closeText":"Close","currentText":"Today","monthNames":["January","February","March","April","May","June","July","August","September","October","November","December"],"monthNamesShort":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"nextText":"Next","prevText":"Previous","dayNames":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],"dayNamesShort":["Sun","Mon","Tue","Wed","Thu","Fri","Sat"],"dayNamesMin":["S","M","T","W","T","F","S"],"dateFormat":"MM d, yy","firstDay":1,"isRTL":false});}); </script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/js/cookie.js?ver=5.5.15' id='js-cookie-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/js/tooltipster.bundle.min.js?ver=5.5.15' id='jquery-np-tooltipster-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/js/jquery.material.form.min.js?ver=5.5.15' id='material-design-js-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/vendor/intl-tel-input/js/intlTelInput-jquery.min.js?ver=5.5.15' id='jquery-intl-phone-input-js-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/js/dialog_trigger.js?ver=5.5.15' id='js-dialog_trigger-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/assets/js/ninjapopups.min.js?ver=5.5.15' id='js-ninjapopups-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/fancybox2/jquery.fancybox.min.js?ver=5.5.15' id='fancybox2-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/jetpack/_inc/build/photon/photon.min.js?ver=20130122' id='jetpack-photon-js'></script> <script type='text/javascript' id='flying-pages-js-before'> window.FPConfig= { delay: 0, ignoreKeywords: ["\/wp-admin","\/wp-login.php","\/cart","add-to-cart","logout","#","?",".png",".jpeg",".jpg",".gif",".svg"], maxRPS: 3, hoverDelay: 50 }; </script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/flying-pages/flying-pages.min.js?ver=2.4.6' id='flying-pages-js' defer></script> <script type='text/javascript' src='https://s0.wp.com/wp-content/js/devicepx-jetpack.js?ver=202505' id='devicepx-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/jetpack/_inc/build/lazy-images/js/lazy-images.min.js?ver=5.9.4' id='jetpack-lazy-images-js'></script> <script type='text/javascript' src='https://www.r-bloggers.com/wp-includes/js/wp-embed.min.js?ver=5.5.15' id='wp-embed-js'></script> <script type='text/javascript' src='https://stats.wp.com/e-202505.js' async='async' defer='defer'></script> <script type='text/javascript'> _stq = window._stq || []; _stq.push([ 'view', {v:'ext',j:'1:5.9.4',blog:'11524731',post:'390192',tz:'-6',srv:'www.r-bloggers.com'} ]); _stq.push([ 'clickTrackerInit', '11524731', '390192' ]); </script> <script type="text/javascript"> jQuery(document).ready(function ($) { for (let i = 0; i < document.forms.length; ++i) { let form = document.forms[i]; if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="XQpuSiekaFA_" value="E*2m.J3" />'); } if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="wUrVyDtxPhBRlzu" value="q7O62_l*RvZ" />'); } } $(document).on('submit', 'form', function () { if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="XQpuSiekaFA_" value="E*2m.J3" />'); } if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="wUrVyDtxPhBRlzu" value="q7O62_l*RvZ" />'); } return true; }); jQuery.ajaxSetup({ beforeSend: function (e, data) { if (data.type !== 'POST') return; if (typeof data.data === 'object' && data.data !== null) { data.data.append("XQpuSiekaFA_", "E*2m.J3"); data.data.append("wUrVyDtxPhBRlzu", "q7O62_l*RvZ"); } else { data.data = data.data + '&XQpuSiekaFA_=E*2m.J3&wUrVyDtxPhBRlzu=q7O62_l*RvZ'; } } }); }); </script> </body> </html> <!-- Dynamic page generated in 5.304 seconds. --> <!-- Cached page generated by WP-Super-Cache on 2025-01-27 19:51:35 --> <!-- Compression = gzip --><script src="/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js" data-cf-settings="d4de2399d46f1041ae12886d-|49" defer></script>