[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Introduction

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

fast_regression(
  .data,
  .rec_obj,
  .parsnip_fns = "all",
  .parsnip_eng = "all",
  .split_type = "initial_split",
  .split_args = NULL,
  .drop_na = TRUE
)

Arguments

.data – The data being passed to the function for the regression problem .rec_obj – The recipe object being passed. .parsnip_fns – The default is ‘all’ which will create all possible regression model specifications supported. .parsnip_eng – The default is ‘all’ which will create all possible regression model specifications supported. .split_type – The default is ‘initial_split’, you can pass any type of split supported by rsample .split_args – The default is NULL, when NULL then the default parameters of the split type will be executed for the rsample split type. .drop_na – The default is TRUE, which will drop all NA’s from the data.

Now let’s see this in action.

Example

We are going to use the mtcars dataset for this example. We will create a regression problem where we are trying to predict mpg using all other variables in the dataset. We will not load in all the libraries that are supported causing the function to return NULL for some models and we will set the parameter .drop_na to FALSE.

library(tidyAML)
library(tidymodels)
library(tidyverse)

tidymodels::tidymodels_prefer()

# Create regression problem
rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
  mtcars,
  rec_obj,
  .parsnip_eng = c("lm","glm","gee"),
  .parsnip_fns = "linear_reg",
  .drop_na = FALSE
  )

glimpse(frt_tbl)
Rows: 3
Columns: 8
$ .model_id       <int> 1, 2, 3
$ .parsnip_engine <chr> "lm", "gee", "glm"
$ .parsnip_mode   <chr> "regression", "regression", "regression"
$ .parsnip_fns    <chr> "linear_reg", "linear_reg", "linear_reg"
$ model_spec      <list> [~NULL, ~NULL, NULL, regression, TRUE, NULL, lm, TRUE]…
$ wflw            <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ fitted_wflw     <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ pred_wflw       <list> [<tbl_df[64 x 3]>], <NULL>, [<tbl_df[64 x 3]>]
extract_wflw(frt_tbl, 1:nrow(frt_tbl))
[[1]]
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm


[[2]]
NULL

[[3]]
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: glm 

Here we can see that the function returned NULL for the gee model because we did not load in the multilevelmod library. We can also see that the function did not drop that model from the output because .drop_na was set to FALSE. Now let’s set it back to TRUE.

frt_tbl <- fast_regression(
  mtcars,
  rec_obj,
  .parsnip_eng = c("lm","glm","gee"),
  .parsnip_fns = "linear_reg",
  .drop_na = TRUE
  )

glimpse(frt_tbl)
Rows: 2
Columns: 8
$ .model_id       <int> 1, 3
$ .parsnip_engine <chr> "lm", "glm"
$ .parsnip_mode   <chr> "regression", "regression"
$ .parsnip_fns    <chr> "linear_reg", "linear_reg"
$ model_spec      <list> [~NULL, ~NULL, NULL, regression, TRUE, NULL, lm, TRUE]…
$ wflw            <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ fitted_wflw     <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ pred_wflw       <list> [<tbl_df[64 x 3]>], [<tbl_df[64 x 3]>]
extract_wflw(frt_tbl, 1:nrow(frt_tbl))
[[1]]
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm


[[2]]
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: glm 

Here we can see that the gee model was dropped from the output because the function could not build the model due to the multilevelmod library not being loaded. This is a great way to drop models that cannot be built due to missing libraries or other reasons.

Conclusion

The .drop_na parameter is a great way to drop models that cannot be built due to missing libraries or other reasons. This is a great addition to the fast_classification() and fast_regression() functions.

Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Using .drop_na in Fast Classification and Regression

Long-Term Implications and Future Developments of .drop_na in Fast Classification and Regression

A significant addition to tidyAML in its new release is a new parameter, .drop_na. This logical value, which defaults to TRUE, is a salient feature in the functions fast_classification() and fast_regression(). Its role is to determine whether the function should drop rows with missing values from the output if a model cannot be built.

What Will It Mean for Future Data Science?

With this update, handling missing values that might pose a problem for creating a regression or classification model becomes more straightforward. The .drop_na parameter is now an integral part of the functionality of tidyAML that makes the process of fast classification and regression even smoother.

The long-term implications are twofold:

  • It boosts efficiency by simplifying the model-building process in significant ways.
  • It improves model accuracy, since any rows with missing values that might affect the building of a model are neatly excluded from the output.

New Horizons: Handling NULL Models

In terms of possible future developments, the .drop_na parameter has opened up new avenues for data handling in R. An exciting potential development could be a modification or an extension of this parameter, allowing it to handle not just rows with NA values, but more complex data discrepancies – such as NULL models.

Actionable Advice: Leveraging This New Addition

Based on this insightful advancement, the following actions are recommended:

  1. Explore and understand: The first step is to gain understanding of how the .drop_na parameter operates. This can be achieved by experimenting with different datasets and seeing its efficacy in action.
  2. Identify cases where it can be used: Once familiar with its functionalities, identify problematic datasets where this parameter can be applied to eliminate rows with NA values.
  3. Explore potential for further development: Building on the existing parameter, data scientists should explore other potential scenarios where similar functionality can be developed, such as handling NULL models.

Note: As with all new features, it is crucial to carefully test this function within your environment and data before implementing it in production.

Read the original article