Vetiver: First steps in MLOps

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part one of a two part series on {vetiver}. Future blogs will be linked here as they are released.

  • Part 1: Vetiver: First steps in MLOps (This post)
  • Part 2: Vetiver: Model Deployment (Coming soon)

Most R users are familiar with the classic workflow popularised by R for Data Science. Data scientists begin by importing and cleaning the data, then iteratively transform, model, and visualise it. Visualisation drives the modeling process, which in turn prompts new visualisations, and periodically, they summarise their work and report results.

Traditional data science workflow diagram. Stages are import, tidy, then transform, visualise, model in a loop, then communicate.

This workflow stems partly from classical statistical modeling, where we are interested in a limited number of models and understanding the system behind the data. In contrast, machine learning prioritises prediction, necessitating the consideration and updating of many models. Machine Learning Operations (MLOps) expands the modeling component of the traditional data science workflow, providing a framework to continuously build, deploy, and maintain machine learning models in production.

Machine learning cycle diagram. Stages are import + tidy, model, version, deploy, monitor, looping backround to import and tidy. Version, deploy and monitor are all gathered under the logo for vetiver.

Data: Importing and Tidying

The first step in deploying your model is automating data importation and tidying. Although this step is a standard part of the data science workflow, a few considerations are worth highlighting.

File formats: Consider moving from large CSV files to a more efficient format like Parquet, which reduces storage costs and simplifies the tidying step.

Moving to packages: As your analysis matures, consider creating an R package to encourage proper documentation, testing, and dependency management.

Tidying & cleaning: With your code in a package and tests in place, optimise bottlenecks to improve efficiency.

Versioning data: Ensure reproducibility by including timestamps in your database queries or otherwise ensuring you can retrieve the same dataset in the future.


Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.


Modelling

This post isn’t focused on modeling frameworks, so we’ll use {tidymodels} and the {palmerpenguins} dataset for brevity.

library("palmerpenguins")
library("tidymodels")
# Remove missing values
penguins_data = tidyr::drop_na(penguins, flipper_length_mm)

We aim to predict penguin species using island, flipper_length_mm, and body_mass_g. A scatter plot indicates this should be feasible. Plot of Body mass (g) vs flipper length (mm). The species of penguin is shown by the colour and the island is shown by the shape. There is a visible split between the Gentoo penguins and the others, with gentoo being overall larger in both ways. The scatter plot points to an obvious separation of Gentoo, to the other species. But pulling apart Adelie / Chinstrap looks a little more tricky.

Modelling wise, we’ll again keep things simple – a straight forward nearest neighbour model, where we use the island, flipper length and body mass to predict species type:

model = recipe(species ~ island + flipper_length_mm + body_mass_g,
 data = penguins_data) |>
 workflow(nearest_neighbor(mode = "classification")) |>
 fit(penguins_data)

The model object can now be used to predict species. Reusing the same data as before, we have an accuracy of around 95%.

model_pred = predict(model, penguins_data)
mean(model_pred$.pred_class == as.character(penguins_data$species))
#> [1] 0.9474

Vetiver Model

Now that we have a model, we can start with MLOps and {vetiver}. First, collate all the necessary information to store, deploy, and version the model.

v_model = vetiver::vetiver_model(model,
 model_name = "k-nn",
 description = "blog-test")
v_model
#> 
#> ── k-nn ─ <bundled_workflow> model for deployment 
#> blog-test using 3 features

The v_model object is a list with six elements, including our description.

names(v_model)
#> [1] "model" "model_name" "description" "metadata" "prototype" 
#> [6] "versioned"

v_model$description
#> [1] "blog-test"

The metadata contains various model-related components.

v_model$metadata
#> $user
#> list()
#> 
#> $version
#> NULL
#> 
#> $url
#> NULL
#> 
#> $required_pkgs
#> [1] "kknn" "parsnip" "recipes" "workflows"

Storing your Model

To deploy a {vetiver} model object, we use a pin from the {pins} package. A pin is simply an R (or Python!) object that is stored for reuse at a later date. The most common use case of the {pins} package (at least for me) is for caching data for a shiny application or quarto document. Basically an easy way to cache data.

However, we can pin any R object – including a pre-built model. We pin objects to “boards” – boards can exist in many places, including Azure, Google drive, or a simple s3 bucket. For this example, I’m using using Posit Connect:

vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)

To retrieve the object, use:

# Not something you would normally do with a {vetiver} model
pins::pin_read(pins::board_connect(), "colin/k-nn")
#> $model
#> bundled workflow object.
#> 
#> $prototype
#> # A tibble: 0 × 3
#> # ℹ 3 variables: island <fct>, flipper_length_mm <int>, body_mass_g <int>

Deploying as an API

The final step is to construct an API around your stored model. This is achieved using the {plumber} package. To deploy locally, i.e. on your own computer, we create a plumber instance and pass the model using {vetiver}

plumber::pr() |>
 vetiver::vetiver_api(v_model) |>
 plumber::pr_run()

This deploys the APIs locally. When you run the code, a browser window will likely open. If it doesn’t simply navigate to http://127.0.0.1:7764/__docs__/.

If the API has successfully deployed, then

base_url = "127.0.0.1:7764/"
url = paste0(base_url, "ping")
r = httr::GET(url)
metadata = httr::content(r, as = "text", encoding = "UTF-8")
jsonlite::fromJSON(metadata)

should return

#$status
#[1] "online"
#
#$time
#[1] "2024-05-27 17:15:39"

The API also has endpoints metadata and pin-url allowing you to programmatically query the model. The key endpoint for MLops, is predict. This endpoint allows you to pass new data to your model, and predict the outcome

url = paste0(base_url, "predict")
endpoint = vetiver::vetiver_endpoint(url)
pred_data = penguins_data |>
 dplyr::select("island", "flipper_length_mm", "body_mass_g") |>
 dplyr::slice_sample(n = 10)
predict(endpoint, pred_data)

Summary

This post introduces MLOps and its applications. In the next post, we’ll discuss deploying models in production.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)