Asynchronous API calls with postlightmercury
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I’ll tell you about a new package I’ve built and also take you under the hood to show you a really cool thing thats going on: asynchronous API calls.
The package: postlightmercury
I created a package called postlightmercury
which is now on cran. The package is a wrapper for the Mercury web parser by Postlight.
Basically you sign up for free, get an API key and with that you can send it urls that it then parses for you. This is actually pretty clever if you are scraping a lot of different websites and you don’t want to write a web parser for each and every one of them.
Here is how the package works
Installation
Since the package is on cran, it’s very straight forward to install it:
install.packages("postlightmercury")
Load libraries
We’ll need the postlightmercury
, dplyr
and stringr
libraries:
library(postlightmercury) library(dplyr) library(stringr)
Get an API key
Before you can use the package you need to get an API key from Postlight. Get yours here: https://mercury.postlight.com/web-parser/ Replace the XXXX’s below with your new API key.
Parse an URL
We will use this extremely sad story from BBC that Gangnam Style is no longer the #1 most viewed video on YouTube. Sad to see a masterpiece like that get dethroned
# Then run the code below replacing the X's wih your api key: parsed_url <- web_parser(page_urls = "http://www.bbc.co.uk/news/entertainment-arts-40566816", api_key = XXXXXXXXXXXXXXXXXXXXXXX)
As you can see below the result is a tibble (data frame) with 14 different variables:
glimpse(parsed_url) ## Observations: 1 ## Variables: 14 ## $ title <chr> "Gangnam Style is no longer the most-played vid... ## $ author <chr> "Mark Savage BBC Music reporter" ## $ date_published <chr> NA ## $ dek <chr> NA ## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9... ## $ content <chr> "<div><p class=\"byline\"> <span class=\"byline... ## $ next_page_url <chr> NA ## $ url <chr> "http://www.bbc.co.uk/news/entertainment-arts-4... ## $ domain <chr> "www.bbc.co.uk" ## $ excerpt <chr> "Psy's megahit was the most-played video for fi... ## $ word_count <int> 685 ## $ direction <chr> "ltr" ## $ total_pages <int> 1 ## $ rendered_pages <int> 1
Parse more than one URL:
You can also parse more than one URL. Instead of one, lets try giving it three URLs, two about Gangnam Style and one about Sauerkraut – with all that dancing it is after all important with proper nutrition.
urls <- c("http://www.bbc.co.uk/news/entertainment-arts-40566816", "http://www.bbc.co.uk/news/world-asia-30288542", "https://www.bbcgoodfood.com/howto/guide/health-benefits-sauerkraut") # Then run the code below replacing the X's wih your api key: parsed_url <- web_parser(page_urls = urls, api_key = XXXXXXXXXXXXXXXXXXXXXXX)
Just like before the result is a tibble (data frame) with 14 different variables – but this time with 3 observations instead of one:
glimpse(parsed_url) ## Observations: 3 ## Variables: 14 ## $ title <chr> "Gangnam Style is no longer the most-played vid... ## $ author <chr> "Mark Savage BBC Music reporter", NA, "Nicola S... ## $ date_published <chr> NA, NA, NA ## $ dek <chr> NA, NA, NA ## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9... ## $ content <chr> "<div><p class=\"byline\"> <span class=\"byline... ## $ next_page_url <chr> NA, NA, NA ## $ url <chr> "http://www.bbc.co.uk/news/entertainment-arts-4... ## $ domain <chr> "www.bbc.co.uk", "www.bbc.co.uk", "www.bbcgoodf... ## $ excerpt <chr> "Psy's megahit was the most-played video for fi... ## $ word_count <int> 685, 305, 527 ## $ direction <chr> "ltr", "ltr", "ltr" ## $ total_pages <int> 1, 1, 1 ## $ rendered_pages <int> 1, 1, 1
Clean the content of HTML
The content block keeps the HTML of the website:
str_trunc(parsed_url$content[1], 500, "right") ## [1] "<div><p class=\"byline\"> <span class=\"byline__name\">By Mark Savage</span> <span class=\"byline__title\">BBC Music reporter</span> </p><div class=\"story-body__inner\"> <figure class=\"media-landscape has-caption full-width lead\"> <span class=\"image-and-copyright-container\"> <img class=\"js-image-replace\" alt=\"Still image from Gangnam Style\" src=\"https://ichef-1.bbci.co.uk/news/320/cpsprodpb/9C7A/production/_96885004_gangnam.jpg\" width=\"1024\"> <span class=\"off-screen\">Image copyright</span> <span cla..."
We can clean that quite easily:
parsed_url$content <- remove_html(parsed_url$content) str_trunc(parsed_url$content[1], 500, "right") ## [1] "By Mark Savage BBC Music reporter Image copyright Schoolboy/Universal Republic Records Image caption Gangnam Style had been YouTube's most-watched video for five years Psy's Gangnam Style is no longer the most-watched video on YouTube.The South Korean megahit had been the site's most-played clip for the last five years.The surreal video became so popular that it \"broke\" YouTube's play counter, exceeding the maximum possible number of views (2,147,483,647), and forcing the company to rewr..."
And that is basically what the package does!
Under the hood: asynchronous API calls
Originally I wrote the package using the httr
package which I normally use for my every day API calling business.
But after reading about the crul
package on R-bloggers and how it can handle asynchronous api calls I rewrote the web_parser()
function so it uses the crul
package.
This means that instead of calling each URL sequentially it calls them in parrallel. This no doubt has major implications if you want to call a lot of URLs and can speed up your analysis significantly.
The web_parser function looks like this under the hood (look for where the magic happens):
web_parser <- function (page_urls, api_key) { if (missing(page_urls)) stop("One or more urls must be provided") if (missing(api_key)) stop("API key must be provided. Get one here: https://mercury.postlight.com/web-parser/") ### THIS IS WHERE THE MAGIC HAPPENS async <- lapply(page_urls, function(page_url) { crul::HttpRequest$new(url = "https://mercury.postlight.com/parser", headers = list(`x-api-key` = api_key))$get(query = list(url = page_url)) }) res <- crul::AsyncVaried$new(.list = async) ### END OF MAGIC output <- res$request() api_content <- lapply(output, function(x) x$parse("UTF-8")) api_content <- lapply(api_content, jsonlite::fromJSON) api_content <- null_to_na(api_content) df <- purrr::map_df(api_content, tibble::as_tibble) return(df) }
As you can see from the above code I create a list async
that holds the three different URL calls. I then add these to the res
object. When I call the results from res
it will fetch the data in parrallel if there is more than one URL. That is pretty smart!
You can use this basic temptlate for your own API calls if you have a function that rutinely calls several URL’s sequentially.
Note: In this case the “surrounding conditions” are all the same. But you can also do asynchronous requests that call different end-points. Check out the crul
package documentation for more on that.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.