Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Introduction
Yesterday I discussed the use of the function internal_make_wflw_predictions()
in the tidyAML
R package. Today I will discuss the use of the function extract_wflw_pred()
and the brand new function extract_regression_residuals()
in the tidyAML
R package. We breifly saw yesterday the output of the function internal_make_wflw_predictions()
which is a list of tibbles that are typically inside of a list column in the final output of fast_regression()
and fast_classification()
. The function extract_wflw_pred()
takes this list of tibbles and extracts them from that output. The function extract_regression_residuals()
also extracts those tibbles and has the added feature of also returning the residuals. Let’s see how these functions work.
The new function
First, we will go over the syntax of the new function extract_regression_residuals()
.
extract_regression_residuals(.model_tbl, .pivot_long = FALSE)
The function takes two arguments. The first argument is .model_tbl
which is the output of fast_regression()
or fast_classification()
. The second argument is .pivot_long
which is a logical argument that defaults to FALSE
. If TRUE
then the output will be in a long format. If FALSE
then the output will be in a wide format. Let’s see how this works.
Example
# Load packages library(tidyAML) library(tidymodels) library(tidyverse) library(multilevelmod) # for the gee model tidymodels_prefer() # good practice when using tidyAML rec_obj <- recipe(mpg ~ ., data = mtcars) frt_tbl <- fast_regression( .data = mtcars, .rec_obj = rec_obj, .parsnip_eng = c("lm","glm","stan","gee"), .parsnip_fns = "linear_reg" )
Let’s break down the R code step by step:
- Loading Libraries:
library(tidyAML) library(tidymodels) library(tidyverse) library(multilevelmod) # for the gee model
Here, the code is loading several R packages. These packages provide functions and tools for data analysis, modeling, and visualization. tidyAML
and tidymodels
are particularly relevant for modeling, while tidyverse
is a collection of packages for data manipulation and visualization. multilevelmod
is included for the Generalized Estimating Equations (gee) model.
-
Setting Preferences:
tidymodels_prefer() # good practice when using tidyAML
This line of code is setting preferences for the tidy modeling workflow using tidymodels_prefer()
. It ensures that when using tidyAML
, the tidy modeling conventions are followed. Tidy modeling involves an organized and consistent approach to modeling in R.
-
Creating a Recipe Object:
rec_obj <- recipe(mpg ~ ., data = mtcars)
Here, a recipe object (rec_obj
) is created using the recipe
function from the tidymodels
package. The formula mpg ~ .
specifies that we want to predict the mpg
variable based on all other variables in the dataset (mtcars
).
-
Performing Fast Regression:
frt_tbl <- fast_regression( .data = mtcars, .rec_obj = rec_obj, .parsnip_eng = c("lm","glm","stan","gee"), .parsnip_fns = "linear_reg" )
This part involves using the fast_regression
function. It performs a fast regression analysis using various engines specified by .parsnip_eng
and specific functions specified by .parsnip_fns
. In this case, it includes linear models (lm
), generalized linear models (glm
), Stan models (stan
), and the Generalized Estimating Equations model (gee
). The results are stored in the frt_tbl
table.
In summary, the code is setting up a tidy modeling workflow, creating a recipe for predicting mpg
based on other variables in the mtcars
dataset, and then performing a fast regression using different engines and functions. The choice of engines and functions allows flexibility in exploring different modeling approaches.
Now that we have the output of fast_regression()
stored in frt_tbl
, we can use the function extract_wflw_pred()
to extract the predictions and from the output. Let’s see how this works. First, the syntax:
extract_wflw_pred(.data, .model_id = NULL)
The function takes two arguments. The first argument is .data
which is the output of fast_regression()
or fast_classification()
. The second argument is .model_id
which is a numeric vector that defaults to NULL
. If NULL
then the function will extract none of the predictions from the output. If a numeric vector is provided then the function will extract the predictions for the models specified by the numeric vector. Let’s see how this works.
extract_wflw_pred(frt_tbl, 1)
# A tibble: 64 × 4 .model_type .data_category .data_type .value <chr> <chr> <chr> <dbl> 1 lm - linear_reg actual actual 15.2 2 lm - linear_reg actual actual 10.4 3 lm - linear_reg actual actual 33.9 4 lm - linear_reg actual actual 32.4 5 lm - linear_reg actual actual 16.4 6 lm - linear_reg actual actual 21.5 7 lm - linear_reg actual actual 15.8 8 lm - linear_reg actual actual 15 9 lm - linear_reg actual actual 14.7 10 lm - linear_reg actual actual 10.4 # ℹ 54 more rows
extract_wflw_pred(frt_tbl, 1:2)
# A tibble: 128 × 4 .model_type .data_category .data_type .value <chr> <chr> <chr> <dbl> 1 lm - linear_reg actual actual 15.2 2 lm - linear_reg actual actual 10.4 3 lm - linear_reg actual actual 33.9 4 lm - linear_reg actual actual 32.4 5 lm - linear_reg actual actual 16.4 6 lm - linear_reg actual actual 21.5 7 lm - linear_reg actual actual 15.8 8 lm - linear_reg actual actual 15 9 lm - linear_reg actual actual 14.7 10 lm - linear_reg actual actual 10.4 # ℹ 118 more rows
extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
# A tibble: 256 × 4 .model_type .data_category .data_type .value <chr> <chr> <chr> <dbl> 1 lm - linear_reg actual actual 15.2 2 lm - linear_reg actual actual 10.4 3 lm - linear_reg actual actual 33.9 4 lm - linear_reg actual actual 32.4 5 lm - linear_reg actual actual 16.4 6 lm - linear_reg actual actual 21.5 7 lm - linear_reg actual actual 15.8 8 lm - linear_reg actual actual 15 9 lm - linear_reg actual actual 14.7 10 lm - linear_reg actual actual 10.4 # ℹ 246 more rows
The first line of code extracts the predictions for the first model in the output. The second line of code extracts the predictions for the first two models in the output. The third line of code extracts the predictions for all models in the output.
Now, let’s visualize the predictions for the models in the output and the actual values. We will use the ggplot2
package for visualization. First, we will extract the predictions for all models in the output and store them in a table called pred_tbl
. Then, we will use ggplot2
to visualize the predictions and actual values.
pred_tbl <- extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl)) pred_tbl |> group_split(.model_type) |> map((x) x |> group_by(.data_category) |> mutate(x = row_number()) |> ungroup() |> pivot_wider(names_from = .data_type, values_from = .value) |> ggplot(aes(x = x, y = actual, group = .data_category)) + geom_line(color = "black") + geom_line(aes(x = x, y = training), linetype = "dashed", color = "red", linewidth = 1) + geom_line(aes(x = x, y = testing), linetype = "dashed", color = "blue", linewidth = 1) + theme_minimal() + labs( x = "", y = "Observed/Predicted Value", title = "Observed vs. Predicted Values by Model Type", subtitle = x$.model_type[1] ) )
[[1]]
[[2]]
[[3]]
[[4]]
Or we can facet them by model type:
pred_tbl |> group_by(.model_type, .data_category) |> mutate(x = row_number()) |> ungroup() |> ggplot(aes(x = x, y = .value)) + geom_line(data = . %>% filter(.data_type == "actual"), color = "black") + geom_line(data = . %>% filter(.data_type == "training"), linetype = "dashed", color = "red") + geom_line(data = . %>% filter(.data_type == "testing"), linetype = "dashed", color = "blue") + facet_wrap(~ .model_type, ncol = 2, scales = "free") + labs( x = "", y = "Observed/Predicted Value", title = "Observed vs. Predicted Values by Model Type" ) + theme_minimal()
Ok, so what about this new function I talked about above? Well let’s go over it here. We have already discussed it’s syntax so no need to go over it again. Let’s just jump right into an example. This function will return the residuals for all models. We will slice off just the first model for demonstration purposes.
extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = FALSE)[[1]]
# A tibble: 32 × 4 .model_type .actual .predicted .resid <chr> <dbl> <dbl> <dbl> 1 lm - linear_reg 15.2 17.3 -2.09 2 lm - linear_reg 10.4 11.9 -1.46 3 lm - linear_reg 33.9 30.8 3.06 4 lm - linear_reg 32.4 28.0 4.35 5 lm - linear_reg 16.4 15.0 1.40 6 lm - linear_reg 21.5 22.3 -0.779 7 lm - linear_reg 15.8 17.2 -1.40 8 lm - linear_reg 15 15.1 -0.100 9 lm - linear_reg 14.7 10.9 3.85 10 lm - linear_reg 10.4 10.8 -0.445 # ℹ 22 more rows
Now let’s set .pivot_long = TRUE
:
extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = TRUE)[[1]]
# A tibble: 96 × 3 .model_type name value <chr> <chr> <dbl> 1 lm - linear_reg .actual 15.2 2 lm - linear_reg .predicted 17.3 3 lm - linear_reg .resid -2.09 4 lm - linear_reg .actual 10.4 5 lm - linear_reg .predicted 11.9 6 lm - linear_reg .resid -1.46 7 lm - linear_reg .actual 33.9 8 lm - linear_reg .predicted 30.8 9 lm - linear_reg .resid 3.06 10 lm - linear_reg .actual 32.4 # ℹ 86 more rows
Now let’s visualize the data:
resid_tbl <- extract_regression_residuals(frt_tbl, TRUE) resid_tbl |> map((x) x |> group_by(name) |> mutate(x = row_number()) |> ungroup() |> mutate(plot_group = ifelse(name == ".resid", "Residuals", "Actual and Predictions")) |> ggplot(aes(x = x, y = value, group = name, color = name)) + geom_line() + theme_minimal() + facet_wrap(~ plot_group, ncol = 1, scales = "free") + labs( x = "", y = "Value", title = "Actual, Predicted, and Residual Values by Model Type", subtitle = x$.model_type[1], color = "Data Type" ) )
[[1]]
[[2]]
[[3]]
[[4]]
And that’s it!
Thank you for reading and I would love to hear your feedback. Please feel free to reach out to me.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: The new function on the block with tidyAML extract_regression_residuals()
Understanding the New Functions in the tidyAML R Package
The post discusses two new functions that have been introduced in the tidyAML R package, namely extract_wflw_pred() and extract_regression_residuals(). These functions are useful for extracting certain data from the output of some specific predictive analysis models.
Function Overview
The function extract_wflw_pred() takes a list of tibbles, typically found in the final output of fast_regression() and fast_classification() functions, and extracts them. Furthermore, the function extract_regression_residuals() performs a similar task but has an additional feature of returning residuals. The feature consequently aids in the further analysis process by revealing the difference between predicted and actual values.
Using the New Functions
The primary use of these added functions involves invoking them after implementing respective predictive models. After using the fast_regression() or fast_classification() function, you can directly use these functions on the output data. The users can select the data format in extract_regression_residuals(); this mainly involves selecting between wide or long formats.
Implementation Example
The blog post provides a comprehensive example of utilizing these functions using the mtcars dataset. Starting from preparing the data for modeling using a tidymodels and tidyAML workflow up to performing the regression and extraction of predictions.
Three scenarios demonstrated the use of extract_wflw_pred(). The first saw an extraction of predictions for the first model in the output, while the second extracted predictions for the first two models. The third example extracted predictions of all models in the output.
The Long-term Implications and Possible Future Developments
With the addition of these new functions, users can anticipate better data extraction process from their predictive analysis models. The ability for extract_wflw_pred() to specify the model from which you wish to extract predictions, and extract_regression_residuals()’s optional output formatting and provision of residuals can be considered as significant strides in predictive data analysis.
As for future developments, there may be an introduction of more functions that further enhance the extraction process or provide more data insights. Moreover, improvements or updates on these functions can also be looked upon. Additionally, an interesting prospect for upcoming additions could be a function that automatically optimizes or selects the best predictive model based on certain criteria.
Actionable Advice
In light of these insights, it is crucial for users working with predictive data analysis models to understand how these new functions operate and hence gain maximum benefit out of them:
- Be sure to understand both the extract_wflw_pred() and extract_regression_residuals() functions and what they can offer.
- Explore different scenarios for using these functions, like extracting predictions from different models, varying the data output format.
- Exploit the residuals provided by extract_regression_residuals() to enhance your understanding and prediction capability of your models.
By doing so, you will be able to harness these additions fully to optimize your work with the tidyAML R package.