Out-of-sample Imputation with {missRanger}

[This article was first published on R – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

Out-of-sample application

The newest CRAN release 2.6.0 offers out-of-sample application. This is useful for removing any leakage between train/test data or during cross-validation. Furthermore, it allows to fill missing values in user provided data. By default, it uses the same number of PMM donors as during training, but you can change this by setting pmm.k = nice value.

We distinguish two types of observations to be imputed:

Easy case: Only a single value is missing. Here, we simply apply the corresponding random forest to fill the one missing value.
Hard case: Multiple values are missing. Here, we first fill the values univariately, and then repeatedly apply the corresponding random forests, with the hope that the effect of univariate imputation vanishes. If values of two highly correlated features are missing, then the imputations can be non-sensical. There is no way to mend this.

Example

To illustrate the technique with a simple example, we use the iris data.

1. First, we randomly add 10% missing values.
2. Then, we make a train/test split.
3. Next, we “fit” missRanger() to the training data.
4. Finally, we use its new predict() method to fill the test data.

library(missRanger)

# 10% missings
ir <- iris |>
  generateNA(p = 0.1, seed = 11)

# Train/test split stratified by Species
oos <- c(1:10, 51:60, 101:110)
train <- ir[-oos, ]
test <- ir[oos, ]

head(test)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3          NA  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4          NA          1.7          NA  setosa

mr <- missRanger(train, pmm.k = 5, keep_forests = TRUE, seed = 1)
test_filled <- predict(mr, test, seed = 1)
head(test_filled)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         4.0          1.7         0.4  setosa

# Original
head(iris)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

The results look reasonable, in this case even for the “hard case” row 6 with missing values in two variables. Here, it is probably the strong association with Species that helped to create good values.

The new predict() also works with single row input.

Learn more about {missRanger}

Basics: https://mayer79.github.io/missRanger/articles/missRanger.html
Multiple imputation: https://mayer79.github.io/missRanger/articles/multiple_imputation.html
Working with survival data: https://mayer79.github.io/missRanger/articles/working_with_censoring.html

The full R script

To leave a comment for the author, please follow the link and comment on their blog: R – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Out-of-sample Imputation with {missRanger}

Long-Term Implications and Future Developments of the missRanger Algorithm

Analysing the original content provided, it is noticeable that the {missRanger} algorithm offers a highly promising domain for multivariate imputation due to its functionality based on random forests. The article also mentions that {missRanger} is a faster version of the original missForest algorithm, a development that underlines the potential for further advancements in this field.

Long-Term Implications

The out-of-sample application featured in the latest CRAN release offers notable capacity for transformative impacts, particularly regarding minimizing data leakage during training/test data and cross-validation. As such processes are fundamental elements in the utilisation of datasets in technology developments, improvements in this area create substantial value.

“The results look reasonable, in this case even for the ‘hard case’ row 6 with missing values in two variables.”

This quotation alludes to the potentially high applicability of such applications, even for more complex cases consisting of multiple missing values. However, in instances where two highly correlated features display missing values, the imputations conducted by the algorithm may result in non-sensible data. This presents an area for potential future developments.

Possible Future Developments

New imputation methodologies: Tackling the issue of unsuitable imputations when dealing with highly correlated missing features could involve devising innovative new imputation methodologies capable of more sophisticated calculations.
Faster versions: Given that {missRanger} is an improved and faster version of the original missForest algorithm, there exists an ongoing potential for progressively faster algorithms to be developed, adding efficiency to the benefits offered.
Better predictions: The combination of random forests with predictive mean matching (PMM) could be further advanced to yield even better predictions. As such, advancements in PMM could be a key area of future work.

Actionable Steps

From the insights gathered from the original content, the following actionable steps can be taken:

Research and explore new imputation methodologies that can tackle the ‘hard case’ instances effectively.
Invest in performance optimization to create increasingly more efficient versions of the current algorithm.
Dedicate resources to improving how effectively random forests combined with predictive mean matching (PMM) can produce predictions.
Benchmark and compare the performance of missRanger with other imputation algorithms.

Read the original article