Monitoring Model Performance Over Time with Vetiver

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

This post is the latest in our three part series on MLOps with Vetiver,
following on from:

Part 1: Vetiver: First steps in
MLOps
Part 2: Vetiver: Model
Deployment

In those blogs, we introduced the {vetiver} package and its use as a
tool for streamlined MLOps. Using the {palmerpenguins} dataset as an
example, we outlined the steps of training a model using {tidymodels}
then converting this into a {vetiver} model. We then demonstrated the
steps of versioning our trained model and deploying it into production.

Getting your first model into production is great! But it’s really only
the beginning, as you will now have to carefully monitor it over time to
ensure that it continues to perform as expected on the latest data.
Thankfully, {vetiver} comes with a suite of functions for this exact
purpose!

Preparing the data

A crucial step in the monitoring process is the introduction of a time
component. We will be tracking key scoring metrics over time as new data
is collected, therefore our analysis will now depend on a time dimension
even if our deployed model has no explicit time dependence.

To demonstrate the monitoring steps, we will be working with the World
Health Organisation Life
Expectancy
data which tracks the average life expectancy in various countries over
a number of years. We start by loading the data:

download.file("https://www.kaggle.com/api/v1/datasets/download/kumarajarshi/life-expectancy-who",
 "archive.zip")
unzip("archive.zip")
life_expectancy = readr::read_csv("./Life Expectancy Data.csv")

We will attempt to predict the life expectancy using the percentage
expenditure, total expenditure, population, body-mass-index (BMI) and
schooling. Let’s select the columns of interest, tidy up the variable
names and drop any missing values:

life_expectancy = life_expectancy |>
 janitor::clean_names(case = "snake",
 abbreviations = c("BMI")) |>
 dplyr::select("year", "life_expectancy",
 "percentage_expenditure",
 "total_expenditure", "population",
 "bmi", "schooling") |>
 tidyr::drop_na()

life_expectancy
#> # A tibble: 2,111 × 7
#> year life_expectancy percentage_expenditure total_expenditure population
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 65 71.3 8.16 33736494
#> 2 2014 59.9 73.5 8.18 327582
#> 3 2013 59.9 73.2 8.13 31731688
#> 4 2012 59.5 78.2 8.52 3696958
#> 5 2011 59.2 7.10 7.87 2978599
#> 6 2010 58.8 79.7 9.2 2883167
#> 7 2009 58.6 56.8 9.42 284331
#> 8 2008 58.1 25.9 8.33 2729431
#> 9 2007 57.5 10.9 6.73 26616792
#> 10 2006 57.3 17.2 7.43 2589345
#> # ℹ 2,101 more rows
#> # ℹ 2 more variables: bmi <dbl>, schooling <dbl>

The data contains a numeric year column which will come in handy for
monitoring the model performance over time. However, the {vetiver}
monitoring functions will require this column to use <date>
("YYYY-MM-DD") formatting and it will have to be sorted in ascending
order:

life_expectancy = life_expectancy |>
 dplyr::mutate(
 year = lubridate::ymd(year, truncated = 2L)
 ) |>
 dplyr::arrange(year)

life_expectancy
#> # A tibble: 2,111 × 7
#> year life_expectancy percentage_expenditure total_expenditure
#> <date> <dbl> <dbl> <dbl>
#> 1 2000-01-01 54.8 10.4 8.2
#> 2 2000-01-01 72.6 91.7 6.26
#> 3 2000-01-01 71.3 154. 3.49
#> 4 2000-01-01 45.3 15.9 2.79
#> 5 2000-01-01 74.1 1349. 9.21
#> 6 2000-01-01 72 32.8 6.25
#> 7 2000-01-01 79.5 347. 8.8
#> 8 2000-01-01 78.1 3557. 1.6
#> 9 2000-01-01 66.6 35.1 4.67
#> 10 2000-01-01 65.3 3.70 2.33
#> # ℹ 2,101 more rows
#> # ℹ 3 more variables: population <dbl>, bmi <dbl>, schooling <dbl>

Finally, let’s imagine the year is currently 2002, so our historical
training data should only cover the years 2000 to 2002:

historic_life_expectancy = life_expectancy |>
 dplyr::filter(year <= "2002-01-01")

Later in this post we will check how our model performs on more recent
data to illustrate the effects of model drift.

Training our model

Before we start training our model, we should split the data into
“train” and “test” sets:

library("tidymodels")

data_split = rsample::initial_split(
 historic_life_expectancy,
 prop = 0.7
)
train_data = rsample::training(data_split)
test_data = rsample::testing(data_split)

The test set makes up 30% of the original data and will be used to score
the model on unseen data following training.

The code cell below handles the steps of setting up a trained model in
{vetiver} and versioning it using {pins}. For a more detailed
explanation of what this code is doing, we refer the reader back to
Part
1.

We will again use a basic K-nearest-neighbour model, although this
time we have set up the workflow as a regression model since we are
predicting a continuous quantity. Note that this requires the {kknn}
package to be installed.

# Train the model with {tidymodels}
model = recipe(
 life_expectancy ~ percentage_expenditure +
 total_expenditure + population + bmi + schooling,
 data = train_data
) |>
 workflow(nearest_neighbor(mode = "regression")) |>
 fit(train_data)

# Convert to a {vetiver} model
v_model = vetiver::vetiver_model(
 model,
 model_name = "k-nn",
 description = "life-expectancy"
)

# Store the model using {pins}
model_board = pins::board_temp(versioned = TRUE)
vetiver::vetiver_pin_write(model_board, v_model)

Here the model {pins} board is created using pins::board_temp() which
generates a temporary local folder.

At this point we should check how our model performs on the unseen test
data. The maximum absolute error (mae), root-mean-squared error
(rmse) and R² (rsq) can be computed over a specified
time period using vetiver::vetiver_compute_metrics():

metrics = augment(v_model, new_data = test_data) |>
 vetiver::vetiver_compute_metrics(
 date_var = year,
 period = "year",
 truth = life_expectancy,
 estimate = .pred
 )

metrics
#> # A tibble: 9 × 5
#> .index .n .metric .estimator .estimate
#> <date> <int> <chr> <chr> <dbl>
#> 1 2000-01-01 46 rmse standard 4.06
#> 2 2000-01-01 46 rsq standard 0.836
#> 3 2000-01-01 46 mae standard 3.05
#> 4 2001-01-01 44 rmse standard 4.61
#> 5 2001-01-01 44 rsq standard 0.844
#> 6 2001-01-01 44 mae standard 3.43
#> 7 2002-01-01 36 rmse standard 4.14
#> 8 2002-01-01 36 rsq standard 0.853
#> 9 2002-01-01 36 mae standard 3.04

The first line of code here sends new data (in this case the unseen test
data) to our model and generates a .pred column containing the model
predictions. This output is then piped to
vetiver::vetiver_compute_metrics() which includes the following
arguments:

date_var: the name of the date column to use for monitoring the
model performance over time.
period: the period ("hour", "day", "week", etc) over which the
scoring metrics should be computed. We are restricted by our data to
using "year"; for more granular data it may be more sensible to
monitor the model over shorter timescales.
truth: the actual values of the target variable (in our example this
is the life_expectancy column of the test data).
estimate: the predictions of the target variable to compare the
actual values against (in our example this is the .pred column
computed in the previous step).

We will come back to these metrics later in this post, so for now let’s
store them along with our model using {pins}:

pins::pin_write(model_board, metrics, "k-nn")

We will skip over the details of deploying our model since this is
already covered in Part
2.

Monitoring our model

Over time we may notice our model start to drift, where its
predictions gradually diverge from the truth as the data evolves. There
are two common causes of this:

Data drift: the statistical distribution of an input variable
changes.
Concept drift: the relationship between the target and an input
variable changes.

Taking the example of life expectancy data:

A country’s expenditure is expected to vary over time due to changes
in government policy and unexpected events like pandemics and economic
crashes. This is data drift.
Advances in medicine may mean that life expectancy can improve even if
BMI remains unchanged. This is concept drift.

Going back to our model which was trained using data from 2000 to 2002,
let’s now check how it would perform on “future” data up to 2010:

# Generate "new" data from 2003 to 2010
new_life_expectancy = life_expectancy |>
 dplyr::filter(year > "2002-01-01" &
 year <= "2010-01-01")

# Score the model performance on the new data
new_metrics = augment(v_model, new_data = new_life_expectancy) |>
 vetiver::vetiver_compute_metrics(
 date_var = year,
 period = "year",
 truth = life_expectancy,
 estimate = .pred
 )

new_metrics
#> # A tibble: 24 × 5
#> .index .n .metric .estimator .estimate
#> <date> <int> <chr> <chr> <dbl>
#> 1 2003-01-01 141 rmse standard 5.21
#> 2 2003-01-01 141 rsq standard 0.760
#> 3 2003-01-01 141 mae standard 3.64
#> 4 2004-01-01 141 rmse standard 5.14
#> 5 2004-01-01 141 rsq standard 0.761
#> 6 2004-01-01 141 mae standard 3.60
#> 7 2005-01-01 141 rmse standard 5.83
#> 8 2005-01-01 141 rsq standard 0.684
#> 9 2005-01-01 141 mae standard 4.19
#> 10 2006-01-01 141 rmse standard 6.23
#> # ℹ 14 more rows

Let’s now store the new metrics in the model {pins} board (along with
the original metrics):

vetiver::vetiver_pin_metrics(
 model_board,
 new_metrics,
 "k-nn"
)

We can now load both the original and new metrics then visualise these
with vetiver::vetiver_plot_metrics():

# Load the metrics
monitoring_metrics = pins::pin_read(model_board, "k-nn")

# Plot the metrics
vetiver::vetiver_plot_metrics(monitoring_metrics) +
 scale_size(name = "Number ofnobservations", range = c(2, 4)) +
 theme_minimal()

The size of the data points represents the number of observations used
to compute the metrics at each period. Up to 2002 we are using the
unseen test data to score our model; after this we are using the full
available data set.

We observe an increasing model error over time, suggesting that the
deployed model should only be trained using the latest data. For this
particular data set it would be sensible to retrain and redeploy the
model annually.

Summary

In this blog we have introduced the idea of monitoring models in
production using the Vetiver framework. Using the life expectancy data
from the World Health Organisation as an example, we have outlined how
to track key model metrics over time and identify model drift.

As you start to retire your old models and replace these with new models
trained on the latest data, make sure to keep ALL of your models (old
and new) versioned and stored. That way you can retrieve any historical
model and establish why it gave a particular prediction on a particular
date.

The {vetiver} framework also includes an R Markdown template for
creating a model monitoring dashboard. For more on this, check out the
{vetiver}
documentation.

The next post in our Vetiver series will provide an outline of the
Python framework. Stay tuned for that sometime in the new year!

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Vetiver: Monitoring Models in Production

Long-term implications and future developments: A deep dive into MLOps with Vetiver

This article centers around the third part of MLOps with Vetiver, focusing on the importance of monitoring models after they have been deployed. Through a detailed discussion on how to prepare, train, and monitor a model, various key points could be discerned.

The importance of model monitoring

Getting your first model into production is only the starting point. Ensuring its continuous performance on the latest data becomes crucial once it’s up and running. The Vetiver framework provides a suite of functions specifically designed for this purpose, allowing you to monitor your model’s performance over time in the context of changes in both input and output variables or relationships.

Concepts of data drift and concept drift

The piece also introduces concepts like data drift and concept drift. Data drift refers to the statistical distribution of an input variable changing over time. Meanwhile, concept drift means the relationship between the target and an input variable shifting. Ensuring that a model continues to perform well despite these changes is a primary goal of monitoring.

Actionable advice on how to approach monitoring

The detailed explanation leaves us with some useful actionable insights:

Adopt a continuous approach to model monitoring, tracking key scoring metrics as data evolves.
Introduce a time component during the monitoring process, as analysis becomes dependent on this.
Ensure that your models are retrained with the latest data periodically, or as often as necessary depending on the application.
Maintain versioning of your models rigorously, this facilitates better tracking of changes and understanding the potential cause of issues.
If the model comes to a point where it errs consistently, it’s wiser to retire old models and replace them with newer ones trained on the latest data.

Future developments with Vetiver

In terms of what’s next for Vetiver, an outline of the Python framework is on the horizon. This could potentially open up new opportunities and make Vetiver useful to a wider audience, not just those using R.

Additionally, Vetiver also provides an R Markdown template for creating model monitoring dashboards. This will provide a more hands-on, visual tool for monitoring models and should aid better decision-making for data scientists in the future.

The ongoing development of Vetiver signifies a focus on closing the gap between model development and deployment, providing robust tools for the maintenance and optimization of models in production.

Read the original article