Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
My last post was using {hstats}, {kernelshap} and {shapviz} to explain a binary classification random forest. Here, we use the same package combo to improve a Poisson GLM with insights from a boosted trees model.
Insurance pricing data
This time, we work with a synthetic, but quite realistic dataset. It describes 1 Mio insurance policies and their corresponding claim counts. A reference for the data is:
Mayer, M., Meier, D. and Wuthrich, M.V. (2023),
SHAP for Actuaries: Explain any Model.
http://dx.doi.org/10.2139/ssrn.4389797
library(OpenML) library(lightgbm) library(splines) library(ggplot2) library(patchwork) library(hstats) library(kernelshap) library(shapviz) #=================================================================== # Load and describe data #=================================================================== df <- getOMLDataSet(data.id = 45106)$data dim(df) # 1000000 7 head(df) # year town driver_age car_weight car_power car_age claim_nb # 2018 1 51 1760 173 3 0 # 2019 1 41 1760 248 2 0 # 2018 1 25 1240 111 2 0 # 2019 0 40 1010 83 9 0 # 2018 0 43 2180 169 5 0 # 2018 1 45 1170 149 1 1 summary(df) # Response ggplot(df, aes(claim_nb)) + geom_bar(fill = "chartreuse4") + ggtitle("Distribution of the response") # Features xvars <- c("year", "town", "driver_age", "car_weight", "car_power", "car_age") df[xvars] |> stack() |> ggplot(aes(values)) + geom_histogram(fill = "chartreuse4", bins = 19) + facet_wrap(~ind, scales = "free", ncol = 2) + ggtitle("Distribution of the features") # car_power and car_weight are correlated 0.68, car_age and driver_age 0.28 df[xvars] |> cor() |> round(2) # year town driver_age car_weight car_power car_age # year 1 0.00 0.00 0.00 0.00 0.00 # town 0 1.00 -0.16 0.00 0.00 0.00 # driver_age 0 -0.16 1.00 0.09 0.10 0.28 # car_weight 0 0.00 0.09 1.00 0.68 0.00 # car_power 0 0.00 0.10 0.68 1.00 0.09 # car_age 0 0.00 0.28 0.00 0.09 1.00
Modeling
- We fit a naive additive linear GLM and a tuned Boosted Trees model.
- We combine the models and specify their predict function.
# Train/test split set.seed(8300) ix <- sample(nrow(df), 0.9 * nrow(df)) train <- df[ix, ] valid <- df[-ix, ] # Naive additive linear Poisson regression model (fit_glm <- glm(claim_nb ~ ., data = train, family = poisson())) # Boosted trees with LightGBM. The parameters (incl. number of rounds) have been # by combining early-stopping with random search CV (not shown here) dtrain <- lgb.Dataset(data.matrix(train[xvars]), label = train$claim_nb) params <- list( learning_rate = 0.05, objective = "poisson", num_leaves = 7, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 0.001, colsample_bynode = 0.8, bagging_fraction = 0.8, lambda_l1 = 3, lambda_l2 = 5 ) fit_lgb <- lgb.train(params = params, data = dtrain, nrounds = 300) # {hstats} works for multi-output predictions, # so we can combine all models to a list, which simplifies the XAI part. models <- list(GLM = fit_glm, LGB = fit_lgb) # Custom predictions on response scale pf <- function(m, X) { cbind( GLM = predict(m$GLM, X, type = "response"), LGB = predict(m$LGB, data.matrix(X[xvars])) ) } pf(models, head(valid, 2)) # GLM LGB # 0.1082285 0.08580529 # 0.1071895 0.09181466 # And on log scale pf_log <- function(m, X) { log(pf(m = m, X = X)) } pf_log(models, head(valid, 2)) # GLM LGB # -2.223510 -2.455675 # -2.233157 -2.387983 -2.346350
Traditional XAI
Performance
Comparing average Poisson deviance on the validation data shows that the LGB model is clearly better than the naively built GLM, so there is room for improvent!
perf <- average_loss( models, X = valid, y = "claim_nb", loss = "poisson", pred_fun = pf ) perf # GLM LGB # 0.4362407 0.4331857
Feature importance
Next, we calculate permutation importance on the validation data with respect to mean Poisson deviance loss. The results make sense, and we note that year
and car_weight
seem to be negligile.
imp <- perm_importance( models, v = xvars, X = valid, y = "claim_nb", loss = "poisson", pred_fun = pf ) plot(imp)
Main effects
Next, we visualize estimated main effects by partial dependence plots on log link scale. The differences between the models are quite small, with one big exception: Investing more parameters into driver_age
via spline will greatly improve the performance and usefulness of the GLM.
partial_dep(models, v = "driver_age", train, pred_fun = pf_log) |> plot(show_points = FALSE) pdp <- function(v) { partial_dep(models, v = v, X = train, pred_fun = pf_log) |> plot(show_points = FALSE) } wrap_plots(lapply(xvars, pdp), guides = "collect") & ylim(-2.8, -1.7)
Interaction effects
Friedman’s H-squared (per feature and feature pair) and on log link scale shows that – unsurprisingly – our GLM does not contain interactions, and that the strongest relative interaction happens between town
and car_power
. The stratified PDP visualizes this interaction. Let’s add a corresponding interaction effect to our GLM later.
system.time( # 5 sec H <- hstats(models, v = xvars, X = train, pred_fun = pf_log) ) H plot(H) # Visualize strongest interaction by stratified PDP partial_dep(models, v = "car_power", X = train, pred_fun = pf_log, BY = "town") |> plot(show_points = FALSE)
SHAP
As an elegant alternative to studying feature importance, PDPs and Friedman’s H, we can simply run a SHAP analysis on the LGB model.
set.seed(22) X_explain <- train[sample(nrow(train), 1000), xvars] shap_values_lgb <- shapviz(fit_lgb, data.matrix(X_explain)) sv_importance(shap_values_lgb) sv_dependence(shap_values_lgb, v = xvars) & ylim(-0.35, 0.8)
Here, we would come to the same conclusions:
car_weight
andyear
might be dropped.- Add a regression spline for
driver_age
- Add an interaction between
car_power
andtown
.
Pimp the GLM
In the final section, we apply the three insights from above with very good results.
fit_glm2 <- glm( claim_nb ~ car_power * town + ns(driver_age, df = 7) + car_age, data = train, family = poisson() # Performance now as good as LGB perf_glm2 <- average_loss( fit_glm2, X = valid, y = "claim_nb", loss = "poisson", type = "response" ) perf_glm2 # 0.432962 # Effects similar as LGB, and smooth partial_dep(fit_glm2, v = "driver_age", X = train) |> plot(show_points = FALSE) partial_dep(fit_glm2, v = "car_power", X = train, BY = "town") |> plot(show_points = FALSE)
Or even via permutation or kernel SHAP:
set.seed(1) bg <- train[sample(nrow(train), 200), ] xvars2 <- setdiff(xvars, c("year", "car_weight")) system.time( # 4 sec ks_glm2 <- permshap(fit_glm2, X = X_explain[xvars2], bg_X = bg) ) shap_values_glm2 <- shapviz(ks_glm2) sv_dependence(shap_values_glm2, v = xvars2) & ylim(-0.3, 0.8)
Final words
- Improving naive GLMs with insights from ML + XAI is fun.
- In practice, the gap between GLM and a boosted trees model can’t be closed that easily. (The true model behind our synthetic dataset contains a single interaction, unlike real data/models that typically have much more interactions.)
- {hstats} can work with multiple regression models in parallel. This helps to keep the workflow smooth. Similar for {kernelshap}.
- A SHAP analysis often brings the same qualitative insights as multiple other XAI tools together.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: ML + XAI -> Strong GLM
Summary and Key Points
The original text explores the application of several packages – {hstats}, {kernelshap} and {shapviz} in improving a Poisson GLM (Generalized Linear Model) using insights from a boosted trees model. The project worked with a synthetic dataset that outlines 1 million insurance policies and their corresponding claim counts.
Long-term Implications and Future Developments
Improving naive GLMs (Generalized Linear Models) with insights from Machine Learning and XAI (Explainable Artificial Intelligence) might bear significant implications for actuarial science and insurance pricing. By applying machine learning and XAI, refined and more accurate models can be developed. This means improved risk assessment, premium calculation, underwriter performance, and ultimately more profitable insurance policies.
Interest in XAI is growing as machine learning becomes more complex. Today’s opaque ML models often fail to provide transparency about their decision-making processes, making their application problematic in regulatory or high-stake areas such as insurance. Therefore, the future may see the development of more tools implementing XAI methodology for better interpretability and trust in machine learning models.
Actionable Advice
Experiment with multiple XAI techniques: The author found that SHAP analysis brought similar qualitative insights as multiple other XAI tools used combined. Therefore, practising different methods might be beneficial to glean better understanding of the models.
Factor in problem inherently complex nature: Slight skepticism was expressed for the applicability of the developed GLM to real-world datasets. Real datasets often contain much more interactions than the utilized synthetic dataset. Take this into account when expanding a technique tested on synthetic data to real use cases.
Continuous Learning: With the rise of XAI and continuous advancements in data science, it is crucial to keep oneself updated with new methodologies and tools to enhance their models over time.
Regulatory Compliance: As systems become more explainable, regulatory bodies might introduce new rules necessitating explicit transparency in machine learning applications. Companies implementing these systems should keep an eye on the regulatory landscape.
Continuous improvement of models: Insights from ML and XAI showed that the GLM model could be significantly improved. This indicates the importance of continuous tweaking and refining of models.
Conclusion
The work exemplifies the potentials of combining traditional statistical models such as GLM with views from machine learning, especially tree-based methods, to achieve better predictions. Likely, utilizing such methods could lead to an enormous revolution in the insurance industry by improving their models to predict risk better, manage claims, and calculate premiums more accurately.