[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Yesterday, with Laurence Barry, we posted a blog post “Who benefits from data sharing?” explaining why data sharing, in insurance, could end mutualization. Actually, it can also be bad in the context of discrimination. Consider here the same dataset, with claim occurence, in a real insurance portfolio,

library(InsurFair)

library(randomForest)

Consider a version of this dataset without the gender, and use variable importance to get a list of variables we can use in a predictive model

subfrenchmotor = frenchmotor[,-which(names(frenchmotor)=="sensitive")]

RF = randomForest(y~. ,data=subfrenchmotor)

vi = varImpPlot(RF , sort = TRUE)

We sort variables based on variable importance (the first one is the “most important” one), and add splines for three continuous variables

dfvi = data.frame(nom = names(subfrenchmotor)[-15], g = as.numeric(vi))

dfvi = dfvi[rev(order(dfvi$g)),]

nom = dfvi$nom

nom[1] = "bs(LicAge)"

nom[3] = "bs(DrivAge)"

nom[7] = "bs(BonusMalus)"

Then, the idea is simple : at stage (k), we keep the (k) most important variables, and run a logistic regression on those (k) variables. Again, I should stress that the gender of the driver is not among those (k) variables. Then, we compute the average prediction of claim frequency, for mean and women.

n=nrow(subfrenchmotor)

library(splines)

idx_F = which(frenchmotor$sensitive == "Female")

idx_M = which(frenchmotor$sensitive == "Male")

metric_gender= function(k =3){

if(k==0){

reg = glm(y~1, family=binomial, data=subfrenchmotor)

yp = predict(reg, type="response")

yp_F = yp[idx_F]

yp_M = yp[idx_M]

sortie = c(mean(yp_F),mean(yp_M),quantile(yp_F,c(.1,.9)),quantile(yp_M,c(.1,.9)))

names(sortie)[1:2]=c("mean_F","mean_M")

}

if(k>0){

vr = paste(nom[1:k],collapse = " + ")

fm = paste("y ~ ",vr,sep="")

reg = glm(fm, family=binomial, data=subfrenchmotor)

yp = predict(reg, type="response")

yp_F = yp[idx_F]

yp_M = yp[idx_M]

sortie = c(mean(yp_F),mean(yp_M),quantile(yp_F,c(.1,.9)),quantile(yp_M,c(.1,.9)))

names(sortie)[1:2]=c("mean_F","mean_M")

}

sortie}

Let us not compute it for all variables

N = 0:15

M = Vectorize(metric_gender)(N)

and plot it

plot(N,M[1,]*100, xlab="Number of predictive variables (without gender)", ylab=

"Average predicted claims frequency (%)", type="b", pch=19, col=COLORS[2], ylim=c(8.12,9))

lines(N, M[2,]*100, type="b", pch=15, col=COLORS[3])

Interestingly, we can clearly see that with 15 explanatory variables, even if our model is gender-blind (since it is not in the training dataset), our model reproduce the difference we can observe in the dataset : annual claim frequency for men is almost 9% and 8.2% for women.

Actually, it is not possible to predict the gender for our 15 variables (below is the ROC curve of the logistic regression to predict the gender)

metric_gender_2= function(k =3){

if(k==0){

reg = glm((sensitive=="Female")~1, family=binomial, data=frenchmotor)

}

if(k>0){

vr = paste(nom[1:k],collapse = " + ")

fm_genre = paste('(sensitive=="Female") ~ ',vr,sep="")

reg = glm(fm_genre, family=binomial, data=frenchmotor)

}

pred = prediction(predict(reg,type="response"),(frenchmotor$sensitive=="Female"))

performance(pred,"tpr","fpr")}

plot(metric_gender_2(15))

but still, when using 15 variables, we obtain discrimination in our portfolio, since the average predictions for mean and women are significantly difference (even if our models are, per se, gender-blind).

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Discrimination by proxy (a real case study)

Long-term Implications and Future Developments

The topic of data sharing in the insurance industry has been a subject of critical analysis. The blog post made by Laurence Barry presents a scenario where an insurance portfolio data set is manipulated to analyze the possible consequences of data sharing. Particular attention is given to explaining why such practices could end mutualization and potentially promote discrimination.

The core of this issue stems from using variable importance in predictive models in insurance data sets even when certain sensitive variables like gender are omitted. According to the intention behind this, it should prevent discrimination based on the suppressed variable. Interestingly, the authors discovered that when they used 15 explanatory variables, there was still discrimination in their predictions. Despite the supposed “gender-blind” model, the prediction model replicated existing differences in the dataset.

Possible Future Developments

This issue of data sharing within insurance practices is a matter of constant evolution. As technology advances, more refined and complex algorithms can be developed, potentially increasing the risk of more sophisticated forms of disguised discrimination. At the same time, improved technology and machine learning methods may provide solutions for eliminating such disparities.

Actionable Advice

Considering this context, here are some actionable recommendations for insurance companies and professionals in this field:

  1. Work towards increased transparency: Insurers should strive to make their predictive models more transparent. This will not only create a better understanding of how these models work among consumers but also among regulatory authorities.
  2. Invest in bias-detection methodology: With the growing concern for unfair discrimination, investing in new methods to detect biases should be a priority for companies that utilize big data.
  3. Consult with regulators: It’s crucial for insurers to consult with their regulators about their use of these models and potential discriminatory ramifications. This way, they can make sure they are compliant with existing laws and regulations.
  4. Establish strict fairness criteria: Insurance companies should define strict fairness criteria to their models which ensure equal opportunities for different groups of people. It’s also advisable to communicate these criteria transparently to the customers and stakeholders.

In conclusion, while big data presents significant opportunities for the insurance industry, it also poses challenges regarding fair practices and discrimination. Therefore, it’s vital for companies to take active measures towards making their predictive models as transparent, fair, and nondiscriminatory as possible.

Read the original article