Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
This is part one of a two part series on {vetiver}. Future blogs will
be linked here as they are released.
- Part 1: Vetiver: First steps in MLOps (This post)
- Part 2: Vetiver: Model Deployment (Coming soon)
Most R users are familiar with the classic workflow popularised by R for
Data Science. Data scientists begin by importing and cleaning the data,
then iteratively transform, model, and visualise it. Visualisation
drives the modeling process, which in turn prompts new visualisations,
and periodically, they summarise their work and report results.
This workflow stems partly from classical statistical modeling, where we
are interested in a limited number of models and understanding the
system behind the data. In contrast, machine learning prioritises
prediction, necessitating the consideration and updating of many models.
Machine Learning Operations (MLOps) expands the modeling component of
the traditional data science workflow, providing a framework to
continuously build, deploy, and maintain machine learning models in
production.
Data: Importing and Tidying
The first step in deploying your model is automating data importation
and tidying. Although this step is a standard part of the data science
workflow, a few considerations are worth highlighting.
File formats: Consider moving from large CSV files to a more
efficient format like Parquet, which reduces storage costs and
simplifies the tidying step.
Moving to packages: As your analysis matures, consider creating an R
package to encourage proper documentation, testing, and dependency
management.
Tidying & cleaning: With your code in a package and tests in place,
optimise bottlenecks to improve efficiency.
Versioning data: Ensure reproducibility by including timestamps in
your database queries or otherwise ensuring you can retrieve the same
dataset in the future.
Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.
Modelling
This post isn’t focused on modeling frameworks, so we’ll use
{tidymodels} and the {palmerpenguins} dataset for brevity.
library("palmerpenguins") library("tidymodels") # Remove missing values penguins_data = tidyr::drop_na(penguins, flipper_length_mm)
We aim to predict penguin species using island, flipper_length_mm, and
body_mass_g. A scatter plot indicates this should be feasible.
The scatter plot points to an obvious separation of Gentoo, to the other
species. But pulling apart Adelie / Chinstrap looks a little more
tricky.
Modelling wise, we’ll again keep things simple – a straight forward
nearest neighbour model, where we use the island, flipper length and
body mass to predict species type:
model = recipe(species ~ island + flipper_length_mm + body_mass_g, data = penguins_data) |> workflow(nearest_neighbor(mode = "classification")) |> fit(penguins_data)
The model object can now be used to predict species. Reusing the same
data as before, we have an accuracy of around 95%.
model_pred = predict(model, penguins_data) mean(model_pred$.pred_class == as.character(penguins_data$species)) #> [1] 0.9474
Vetiver Model
Now that we have a model, we can start with MLOps and {vetiver}. First,
collate all the necessary information to store, deploy, and version the
model.
v_model = vetiver::vetiver_model(model, model_name = "k-nn", description = "blog-test") v_model #> #> ── k-nn ─ <bundled_workflow> model for deployment #> blog-test using 3 features
The v_model
object is a list with six elements, including our
description.
names(v_model) #> [1] "model" "model_name" "description" "metadata" "prototype" #> [6] "versioned" v_model$description #> [1] "blog-test"
The metadata
contains various model-related components.
v_model$metadata #> $user #> list() #> #> $version #> NULL #> #> $url #> NULL #> #> $required_pkgs #> [1] "kknn" "parsnip" "recipes" "workflows"
Storing your Model
To deploy a {vetiver} model object, we use a pin from the {pins}
package. A pin is simply an R (or Python!) object that is stored for
reuse at a later date. The most common use case of the {pins} package
(at least for me) is for caching data for a shiny application or quarto
document. Basically an easy way to cache
data.
However, we can pin any R object – including a pre-built model. We
pin objects to “boards” – boards can exist in many places, including
Azure, Google drive, or a simple s3 bucket. For this example, I’m using
using Posit Connect:
vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)
To retrieve the object, use:
# Not something you would normally do with a {vetiver} model pins::pin_read(pins::board_connect(), "colin/k-nn") #> $model #> bundled workflow object. #> #> $prototype #> # A tibble: 0 × 3 #> # ℹ 3 variables: island <fct>, flipper_length_mm <int>, body_mass_g <int>
Deploying as an API
The final step is to construct an API around your stored model. This is
achieved using the {plumber} package. To deploy locally, i.e. on your
own computer, we create a plumber instance and pass the model using
{vetiver}
plumber::pr() |> vetiver::vetiver_api(v_model) |> plumber::pr_run()
This deploys the APIs locally. When you run the code, a browser window
will likely open. If it doesn’t simply navigate to
http://127.0.0.1:7764/__docs__/
.
If the API has successfully deployed, then
base_url = "127.0.0.1:7764/" url = paste0(base_url, "ping") r = httr::GET(url) metadata = httr::content(r, as = "text", encoding = "UTF-8") jsonlite::fromJSON(metadata)
should return
#$status #[1] "online" # #$time #[1] "2024-05-27 17:15:39"
The API also has endpoints metadata
and pin-url
allowing you to
programmatically query the model. The key endpoint for MLops, is
predict
. This endpoint allows you to pass new data to your model, and
predict the outcome
url = paste0(base_url, "predict") endpoint = vetiver::vetiver_endpoint(url) pred_data = penguins_data |> dplyr::select("island", "flipper_length_mm", "body_mass_g") |> dplyr::slice_sample(n = 10) predict(endpoint, pred_data)
Summary
This post introduces MLOps and its applications. In the next post, we’ll
discuss deploying models in production.
For updates and revisions to this article, see the original post
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Vetiver: First steps in MLOps
Understanding MLOps and its Role in Data Science Workflows
The current state of data science, particularly in the realm of R users, begins with data importation and cleaning, followed by iterative transformations, modeling, and visualizations to drive the work forward. This workflow is largely a result of traditional statistical modeling, a method that involves studying and understanding the system underlying the data.
Machine Learning Operations, or MLOps, however, is structured differently. It prioritizes prediction over understanding and involves the consideration of several different models. MLOps broadens the modeling aspect of the data science workflow and offers a structure to continuously build, deploy, and manage machine learning models efficiently.
Relevance of Data Importation and Tidying
Automating the process of importing and tidying data is the first step in deploying a model. Key factors to consider include the file formats, moving to packages as the analysis evolves, optimising bottlenecks in tidying and cleaning, and ensuring data reproducibility with versioning. Pay attention to how these considerations fit into the broader context of perpetually building, deploying, and managing machine learning models.
Predicting with Models
The {tidymodels} and {palmerpenguins} datasets are used for an example of predicting penguin species based on the island, flipper length, and body mass. By working with pre-existing data, a model was built that achieved an accuracy of around 95%. This model can now be used to predict species, thus illustrating a simple application of machine learning in a data science workflow.
Storing Your Model
After building a model, MLOps comes into play, with a package called {vetiver}. It collates crucial details to store, deploy, and version the model. This is achieved through a pin from the {pins} package, an R object that can be stored for reuse. These pins are stored on boards, which can be located in various places, such as Azure and Google Drive. Once the model is pinned to a board, it can be retrieved for use in future computation.
Deploying as an API
The final deployment phase involves creating an API around your stored model using the {plumber} package. An API endpoint’s main role in MLOps is the prediction endpoint, which allows for new data passage to the model and predicts the outcome based on existing patterns in the model.
Actionable points based on insights
- In your data science workflow, consider integrating MLOps strategies for efficient, continuous build, deployment, and management of machine learning models.
- Take automation steps in data importation and tidying to ease the process of data handling in model deployment.
- As the analysis evolves, consider moving to R packages for documentation, testing, and dependency management.
- Consider using efficient file formats like Parquet.
- Use tools like {vetiver} and {plumber} for model deployment and embedding your model into an API respectively.
- Prepare for the future of MLOps by keeping up-to-date with information regarding model deployment in R.
In the long run, as machine learning becomes even more integrated into data science, developing knowledge and skills in MLOps will be indispensable. As seen in this exploration of R’s MLOps, the process of preparing, creating, deploying, and maintaining machine learning models has massive implications for efficiency and reproducibility, leading to a more streamlined data science workflow.