Site icon R-bloggers

An R AWS Lambda function to download Tidytuesday datasets

[This article was first published on R | Discindo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Use {r2lambda} to download Tidytuesday dataset

In this exercise, we’ll create an AWS Lambda function that downloads the tidytuesday data set for the most recent Tuesday (or most recent Tuesday from a date of interest).

Required packages

library(r2lambda)
library(jsonlite)
library(magrittr)

Runtime function

The first step is to write the runtime function. This is the function that will be executed when we invoke the Lambda function after it has been deployed. To download the Tidytuesday data set, we will use the {tidytuesdayR} package. In the runtime script, we define a function called tidytyesday_lambda that takes one optional argument date. If date is omitted, the function returns the data set(s) for the most recent Tuesday, otherwise, it looks up the most recent Tuesday from a date of interest and returns the corresponding data set(s).

library(tidytuesdayR)
tidytuesday_lambda <- function(date = NULL) {
if (is.null(date))
date <- Sys.Date()
most_recent_tuesday <- tidytuesdayR::last_tuesday(date = date)
tt_data <- tidytuesdayR::tt_load(x = most_recent_tuesday)
data_names <- names(tt_data)
data_list <- lapply(data_names, function(x) tt_data[[x]])
return(data_list)
}
tidytuesday_lambda("2022-02-02")

R script to build the lambda

To build the lambda image, we need an R script that sources any required code, loads any needed libraries, defines a runtime function, and ends with a call to lambdr::start_lambda(). The runtime function does not have to be defined in this file. We could, for example, source another script, or load a package and set a loaded function as the runtime function in the subsequent call to r2lambda::build_lambda (see below). We save this script to a file and record the path:

r_code <- "
library(tidytuesdayR)
tidytuesday_lambda <- function(date = NULL) {
if (is.null(date))
date <- Sys.Date()
most_recent_tuesday <- tidytuesdayR::last_tuesday(date = date)
tt_data <- tidytuesdayR::tt_load(x = most_recent_tuesday)
data_names <- names(tt_data)
data_list <- lapply(data_names, function(x) tt_data[[x]])
return(data_list)
}
lambdr::start_lambda()
"
tmpfile <- tempfile(pattern = "ttlambda_", fileext = ".R")
write(x = r_code, file = tmpfile)

Build, test, and deploy the lambda function

1. Build

runtime_function <- "tidytuesday_lambda"
runtime_path <- tmpfile
dependencies <- "tidytuesdayR"
r2lambda::build_lambda(
tag = "tidytuesday3",
runtime_function = runtime_function,
runtime_path = runtime_path,
dependencies = dependencies
)

2. Test

To make sure our Lambda docker container works as intended, we start it locally, and invoke it to test the response. The response is a list of three elements:

response <- r2lambda::test_lambda(tag = "tidytuesday3", payload = list(date = Sys.Date()))

stdout and stderr are raw vectors that we need to parse, for example:

rawToChar(response$stdout)

If the stdout slot of the response returns the correct output of our function, we are good to deploy to AWS.

3. Deploy

The deployment step is simple, in that all we need to do is specify the name (tag) of the Lambda function we wish to push to AWS ECR. The deploy_lambda function also accepts ..., which are named arguments ultimately passed onto paws.compute:::lambda_create_function. This is the function that calls the Lambda API. To see all available arguments run ?paws.compute:::lambda_create_function.

The most important arguments are probably Timeout and MemorySize, which set the time our function will be allowed to run and the amount of memory it will have available. In many cases it will make sense to increase the defaults of 3 seconds and 128 mb.

r2lambda::deploy_lambda(tag = "tidytuesday3", Timeout = 30)

4. Invoke

If all goes well, our function should now be available on the cloud awaiting requests. We can invoke it from R using invoke_lambda. The arguments are:

response <- r2lambda::invoke_lambda(
function_name = "tidytuesday3",
invocation_type = "RequestResponse",
payload = list(),
include_logs = TRUE
)

Just like in the local test, the response payload comes as a raw vector that needs to be parsed into a data.frame:

tidytuesday_dataset <- response$Payload %>%
rawToChar() %>%
jsonlite::fromJSON(simplifyDataFrame = TRUE)
tidytuesday_dataset[[1]][1:5, 1:5]

Summary

In this post, we went over some details about:

Stay tuned for a follow-up post where we set this Lambda function to run on a weekly schedule!

To leave a comment for the author, please follow the link and comment on their blog: R | Discindo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version