[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Introduction

Yesterday I discussed the use of the function internal_make_wflw_predictions() in the tidyAML R package. Today I will discuss the use of the function extract_wflw_pred() and the brand new function extract_regression_residuals() in the tidyAML R package. We breifly saw yesterday the output of the function internal_make_wflw_predictions() which is a list of tibbles that are typically inside of a list column in the final output of fast_regression() and fast_classification(). The function extract_wflw_pred() takes this list of tibbles and extracts them from that output. The function extract_regression_residuals() also extracts those tibbles and has the added feature of also returning the residuals. Let’s see how these functions work.

The new function

First, we will go over the syntax of the new function extract_regression_residuals().

extract_regression_residuals(.model_tbl, .pivot_long = FALSE)

The function takes two arguments. The first argument is .model_tbl which is the output of fast_regression() or fast_classification(). The second argument is .pivot_long which is a logical argument that defaults to FALSE. If TRUE then the output will be in a long format. If FALSE then the output will be in a wide format. Let’s see how this works.

Example

# Load packages
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model

tidymodels_prefer() # good practice when using tidyAML

rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
  .data = mtcars,
  .rec_obj = rec_obj,
  .parsnip_eng = c("lm","glm","stan","gee"),
  .parsnip_fns = "linear_reg"
  )

Let’s break down the R code step by step:

  1. Loading Libraries:
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model

Here, the code is loading several R packages. These packages provide functions and tools for data analysis, modeling, and visualization. tidyAML and tidymodels are particularly relevant for modeling, while tidyverse is a collection of packages for data manipulation and visualization. multilevelmod is included for the Generalized Estimating Equations (gee) model.

  1. Setting Preferences:

    tidymodels_prefer() # good practice when using tidyAML

This line of code is setting preferences for the tidy modeling workflow using tidymodels_prefer(). It ensures that when using tidyAML, the tidy modeling conventions are followed. Tidy modeling involves an organized and consistent approach to modeling in R.

  1. Creating a Recipe Object:

    rec_obj <- recipe(mpg ~ ., data = mtcars)

Here, a recipe object (rec_obj) is created using the recipe function from the tidymodels package. The formula mpg ~ . specifies that we want to predict the mpg variable based on all other variables in the dataset (mtcars).

  1. Performing Fast Regression:

    frt_tbl <- fast_regression(
      .data = mtcars,
      .rec_obj = rec_obj,
      .parsnip_eng = c("lm","glm","stan","gee"),
      .parsnip_fns = "linear_reg"
    )

This part involves using the fast_regression function. It performs a fast regression analysis using various engines specified by .parsnip_eng and specific functions specified by .parsnip_fns. In this case, it includes linear models (lm), generalized linear models (glm), Stan models (stan), and the Generalized Estimating Equations model (gee). The results are stored in the frt_tbl table.

In summary, the code is setting up a tidy modeling workflow, creating a recipe for predicting mpg based on other variables in the mtcars dataset, and then performing a fast regression using different engines and functions. The choice of engines and functions allows flexibility in exploring different modeling approaches.

Now that we have the output of fast_regression() stored in frt_tbl, we can use the function extract_wflw_pred() to extract the predictions and from the output. Let’s see how this works. First, the syntax:

extract_wflw_pred(.data, .model_id = NULL)

The function takes two arguments. The first argument is .data which is the output of fast_regression() or fast_classification(). The second argument is .model_id which is a numeric vector that defaults to NULL. If NULL then the function will extract none of the predictions from the output. If a numeric vector is provided then the function will extract the predictions for the models specified by the numeric vector. Let’s see how this works.

extract_wflw_pred(frt_tbl, 1)
# A tibble: 64 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 54 more rows
extract_wflw_pred(frt_tbl, 1:2)
# A tibble: 128 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 118 more rows
extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
# A tibble: 256 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 246 more rows

The first line of code extracts the predictions for the first model in the output. The second line of code extracts the predictions for the first two models in the output. The third line of code extracts the predictions for all models in the output.

Now, let’s visualize the predictions for the models in the output and the actual values. We will use the ggplot2 package for visualization. First, we will extract the predictions for all models in the output and store them in a table called pred_tbl. Then, we will use ggplot2 to visualize the predictions and actual values.

pred_tbl <- extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))

pred_tbl |>
  group_split(.model_type) |>
  map((x) x |>
        group_by(.data_category) |>
        mutate(x = row_number()) |>
        ungroup() |>
        pivot_wider(names_from = .data_type, values_from = .value) |>
        ggplot(aes(x = x, y = actual, group = .data_category)) +
        geom_line(color = "black") +
        geom_line(aes(x = x, y = training), linetype = "dashed", color = "red",
                  linewidth = 1) +
        geom_line(aes(x = x, y = testing), linetype = "dashed", color = "blue",
                  linewidth = 1) +
        theme_minimal() +
        labs(
          x = "",
          y = "Observed/Predicted Value",
          title = "Observed vs. Predicted Values by Model Type",
          subtitle = x$.model_type[1]
        )
      )
[[1]]

[[2]]

[[3]]

[[4]]

Or we can facet them by model type:

pred_tbl |>
  group_by(.model_type, .data_category) |>
  mutate(x = row_number()) |>
  ungroup() |>
  ggplot(aes(x = x, y = .value)) +
  geom_line(data = . %>% filter(.data_type == "actual"), color = "black") +
  geom_line(data = . %>% filter(.data_type == "training"),
            linetype = "dashed", color = "red") +
  geom_line(data = . %>% filter(.data_type == "testing"),
            linetype = "dashed", color = "blue") +
  facet_wrap(~ .model_type, ncol = 2, scales = "free") +
  labs(
    x = "",
    y = "Observed/Predicted Value",
    title = "Observed vs. Predicted Values by Model Type"
  ) +
  theme_minimal()

Ok, so what about this new function I talked about above? Well let’s go over it here. We have already discussed it’s syntax so no need to go over it again. Let’s just jump right into an example. This function will return the residuals for all models. We will slice off just the first model for demonstration purposes.

extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = FALSE)[[1]]
# A tibble: 32 × 4
   .model_type     .actual .predicted .resid
   <chr>             <dbl>      <dbl>  <dbl>
 1 lm - linear_reg    15.2       17.3 -2.09
 2 lm - linear_reg    10.4       11.9 -1.46
 3 lm - linear_reg    33.9       30.8  3.06
 4 lm - linear_reg    32.4       28.0  4.35
 5 lm - linear_reg    16.4       15.0  1.40
 6 lm - linear_reg    21.5       22.3 -0.779
 7 lm - linear_reg    15.8       17.2 -1.40
 8 lm - linear_reg    15         15.1 -0.100
 9 lm - linear_reg    14.7       10.9  3.85
10 lm - linear_reg    10.4       10.8 -0.445
# ℹ 22 more rows

Now let’s set .pivot_long = TRUE:

extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = TRUE)[[1]]
# A tibble: 96 × 3
   .model_type     name       value
   <chr>           <chr>      <dbl>
 1 lm - linear_reg .actual    15.2
 2 lm - linear_reg .predicted 17.3
 3 lm - linear_reg .resid     -2.09
 4 lm - linear_reg .actual    10.4
 5 lm - linear_reg .predicted 11.9
 6 lm - linear_reg .resid     -1.46
 7 lm - linear_reg .actual    33.9
 8 lm - linear_reg .predicted 30.8
 9 lm - linear_reg .resid      3.06
10 lm - linear_reg .actual    32.4
# ℹ 86 more rows

Now let’s visualize the data:

resid_tbl <- extract_regression_residuals(frt_tbl, TRUE)

resid_tbl |>
  map((x) x |>
        group_by(name) |>
        mutate(x = row_number()) |>
        ungroup() |>
        mutate(plot_group = ifelse(name == ".resid", "Residuals", "Actual and Predictions")) |>
        ggplot(aes(x = x, y = value, group = name, color = name)) +
        geom_line() +
        theme_minimal() +
        facet_wrap(~ plot_group, ncol = 1, scales = "free") +
        labs(
          x = "",
          y = "Value",
          title = "Actual, Predicted, and Residual Values by Model Type",
          subtitle = x$.model_type[1],
          color = "Data Type"
        )
      )
[[1]]

[[2]]

[[3]]

[[4]]

And that’s it!

Thank you for reading and I would love to hear your feedback. Please feel free to reach out to me.

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: The new function on the block with tidyAML extract_regression_residuals()

Understanding the New Functions in the tidyAML R Package

The post discusses two new functions that have been introduced in the tidyAML R package, namely extract_wflw_pred() and extract_regression_residuals(). These functions are useful for extracting certain data from the output of some specific predictive analysis models.

Function Overview

The function extract_wflw_pred() takes a list of tibbles, typically found in the final output of fast_regression() and fast_classification() functions, and extracts them. Furthermore, the function extract_regression_residuals() performs a similar task but has an additional feature of returning residuals. The feature consequently aids in the further analysis process by revealing the difference between predicted and actual values.

Using the New Functions

The primary use of these added functions involves invoking them after implementing respective predictive models. After using the fast_regression() or fast_classification() function, you can directly use these functions on the output data. The users can select the data format in extract_regression_residuals(); this mainly involves selecting between wide or long formats.

Implementation Example

The blog post provides a comprehensive example of utilizing these functions using the mtcars dataset. Starting from preparing the data for modeling using a tidymodels and tidyAML workflow up to performing the regression and extraction of predictions.

Three scenarios demonstrated the use of extract_wflw_pred(). The first saw an extraction of predictions for the first model in the output, while the second extracted predictions for the first two models. The third example extracted predictions of all models in the output.

The Long-term Implications and Possible Future Developments

With the addition of these new functions, users can anticipate better data extraction process from their predictive analysis models. The ability for extract_wflw_pred() to specify the model from which you wish to extract predictions, and extract_regression_residuals()’s optional output formatting and provision of residuals can be considered as significant strides in predictive data analysis.

As for future developments, there may be an introduction of more functions that further enhance the extraction process or provide more data insights. Moreover, improvements or updates on these functions can also be looked upon. Additionally, an interesting prospect for upcoming additions could be a function that automatically optimizes or selects the best predictive model based on certain criteria.

Actionable Advice

In light of these insights, it is crucial for users working with predictive data analysis models to understand how these new functions operate and hence gain maximum benefit out of them:

  1. Be sure to understand both the extract_wflw_pred() and extract_regression_residuals() functions and what they can offer.
  2. Explore different scenarios for using these functions, like extracting predictions from different models, varying the data output format.
  3. Exploit the residuals provided by extract_regression_residuals() to enhance your understanding and prediction capability of your models.

By doing so, you will be able to harness these additions fully to optimize your work with the tidyAML R package.

Read the original article