Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
My last post was using {hstats}, {kernelshap} and {shapviz} to explain a binary classification random forest. Here, we use the same package combo to improve a Poisson GLM with insights from a boosted trees model.
Insurance pricing data
This time, we work with a synthetic, but quite realistic dataset. It describes 1 Mio insurance policies and their corresponding claim counts. A reference for the data is:
Mayer, M., Meier, D. and Wuthrich, M.V. (2023), SHAP for Actuaries: Explain any Model. http://dx.doi.org/10.2139/ssrn.4389797
We fit a naive additive linear GLM and a tuned Boosted Trees model.
We combine the models and specify their predict function.
# Train/test split
set.seed(8300)
ix <- sample(nrow(df), 0.9 * nrow(df))
train <- df[ix, ]
valid <- df[-ix, ]
# Naive additive linear Poisson regression model
(fit_glm <- glm(claim_nb ~ ., data = train, family = poisson()))
# Boosted trees with LightGBM. The parameters (incl. number of rounds) have been
# by combining early-stopping with random search CV (not shown here)
dtrain <- lgb.Dataset(data.matrix(train[xvars]), label = train$claim_nb)
params <- list(
learning_rate = 0.05,
objective = "poisson",
num_leaves = 7,
min_data_in_leaf = 50,
min_sum_hessian_in_leaf = 0.001,
colsample_bynode = 0.8,
bagging_fraction = 0.8,
lambda_l1 = 3,
lambda_l2 = 5
)
fit_lgb <- lgb.train(params = params, data = dtrain, nrounds = 300)
# {hstats} works for multi-output predictions,
# so we can combine all models to a list, which simplifies the XAI part.
models <- list(GLM = fit_glm, LGB = fit_lgb)
# Custom predictions on response scale
pf <- function(m, X) {
cbind(
GLM = predict(m$GLM, X, type = "response"),
LGB = predict(m$LGB, data.matrix(X[xvars]))
)
}
pf(models, head(valid, 2))
# GLM LGB
# 0.1082285 0.08580529
# 0.1071895 0.09181466
# And on log scale
pf_log <- function(m, X) {
log(pf(m = m, X = X))
}
pf_log(models, head(valid, 2))
# GLM LGB
# -2.223510 -2.455675
# -2.233157 -2.387983 -2.346350
Traditional XAI
Performance
Comparing average Poisson deviance on the validation data shows that the LGB model is clearly better than the naively built GLM, so there is room for improvent!
perf <- average_loss(
models, X = valid, y = "claim_nb", loss = "poisson", pred_fun = pf
)
perf
# GLM LGB
# 0.4362407 0.4331857
Feature importance
Next, we calculate permutation importance on the validation data with respect to mean Poisson deviance loss. The results make sense, and we note that year and car_weight seem to be negligile.
imp <- perm_importance(
models, v = xvars, X = valid, y = "claim_nb", loss = "poisson", pred_fun = pf
)
plot(imp)
Main effects
Next, we visualize estimated main effects by partial dependence plots on log link scale. The differences between the models are quite small, with one big exception: Investing more parameters into driver_age via spline will greatly improve the performance and usefulness of the GLM.
Friedman’s H-squared (per feature and feature pair) and on log link scale shows that – unsurprisingly – our GLM does not contain interactions, and that the strongest relative interaction happens between town and car_power. The stratified PDP visualizes this interaction. Let’s add a corresponding interaction effect to our GLM later.
system.time( # 5 sec
H <- hstats(models, v = xvars, X = train, pred_fun = pf_log)
)
H
plot(H)
# Visualize strongest interaction by stratified PDP
partial_dep(models, v = "car_power", X = train, pred_fun = pf_log, BY = "town") |>
plot(show_points = FALSE)
SHAP
As an elegant alternative to studying feature importance, PDPs and Friedman’s H, we can simply run a SHAP analysis on the LGB model.
In the final section, we apply the three insights from above with very good results.
fit_glm2 <- glm(
claim_nb ~ car_power * town + ns(driver_age, df = 7) + car_age,
data = train,
family = poisson()
# Performance now as good as LGB
perf_glm2 <- average_loss(
fit_glm2, X = valid, y = "claim_nb", loss = "poisson", type = "response"
)
perf_glm2 # 0.432962
# Effects similar as LGB, and smooth
partial_dep(fit_glm2, v = "driver_age", X = train) |>
plot(show_points = FALSE)
partial_dep(fit_glm2, v = "car_power", X = train, BY = "town") |>
plot(show_points = FALSE)
Improving naive GLMs with insights from ML + XAI is fun.
In practice, the gap between GLM and a boosted trees model can’t be closed that easily. (The true model behind our synthetic dataset contains a single interaction, unlike real data/models that typically have much more interactions.)
{hstats} can work with multiple regression models in parallel. This helps to keep the workflow smooth. Similar for {kernelshap}.
A SHAP analysis often brings the same qualitative insights as multiple other XAI tools together.
The original text explores the application of several packages – {hstats}, {kernelshap} and {shapviz} in improving a Poisson GLM (Generalized Linear Model) using insights from a boosted trees model. The project worked with a synthetic dataset that outlines 1 million insurance policies and their corresponding claim counts.
Long-term Implications and Future Developments
Improving naive GLMs (Generalized Linear Models) with insights from Machine Learning and XAI (Explainable Artificial Intelligence) might bear significant implications for actuarial science and insurance pricing. By applying machine learning and XAI, refined and more accurate models can be developed. This means improved risk assessment, premium calculation, underwriter performance, and ultimately more profitable insurance policies.
Interest in XAI is growing as machine learning becomes more complex. Today’s opaque ML models often fail to provide transparency about their decision-making processes, making their application problematic in regulatory or high-stake areas such as insurance. Therefore, the future may see the development of more tools implementing XAI methodology for better interpretability and trust in machine learning models.
Actionable Advice
Experiment with multiple XAI techniques: The author found that SHAP analysis brought similar qualitative insights as multiple other XAI tools used combined. Therefore, practising different methods might be beneficial to glean better understanding of the models.
Factor in problem inherently complex nature: Slight skepticism was expressed for the applicability of the developed GLM to real-world datasets. Real datasets often contain much more interactions than the utilized synthetic dataset. Take this into account when expanding a technique tested on synthetic data to real use cases.
Continuous Learning: With the rise of XAI and continuous advancements in data science, it is crucial to keep oneself updated with new methodologies and tools to enhance their models over time.
Regulatory Compliance: As systems become more explainable, regulatory bodies might introduce new rules necessitating explicit transparency in machine learning applications. Companies implementing these systems should keep an eye on the regulatory landscape.
Continuous improvement of models: Insights from ML and XAI showed that the GLM model could be significantly improved. This indicates the importance of continuous tweaking and refining of models.
Conclusion
The work exemplifies the potentials of combining traditional statistical models such as GLM with views from machine learning, especially tree-based methods, to achieve better predictions. Likely, utilizing such methods could lead to an enormous revolution in the insurance industry by improving their models to predict risk better, manage claims, and calculate premiums more accurately.
2024 promises to be a breakout year for Generative AI (GenAI) and AI. However, there are two challenges that organizations will face in 2024 to “leverage AI to get value from their data.” Challenge #1: Too much focus is on “implementing AI” and not enough on gaining organizational alignment regarding where and how value will… Read More »GenAI: Beware the Productivity Trap; It’s About Cultural Empowerment – Part 3
Long-Term Implications and Possible Future Developments of Generative AI
The prospect that 2024 will be a groundbreaking year for Generative AI presents a thrilling prospect for businesses worldwide. The power that GenAI carries in transforming business operations, enhancing productivity, and creating value cannot be overstated. However, successfully boarding the GenAI train will require strategic preparation.
In dealing with challenges such as the tendency to focus more on AI implementation than on aligning the organization to where and how value can be derived, organizations need to be clever. It is important to remember that embracing GenAI is not just about improving productivity but also about promoting a shift in culture to maximize the technology’s benefits.
Understanding the Challenges
Too Much Focus on AI Implementation
While it’s true that executing AI systems properly is essential for them to work, putting too much emphasis on this process can divert attention from the broader picture. The purpose of deploying such technologies is, after all, to get real value from data. If an organization overlooks the process of recognizing where and how to extract this value, then they may fail to reap full benefits from their AI tools.
Lack of Cultural Empowerment
An ‘AI culture’ goes beyond just using the technology; it includes fostering an environment where data-driven decision making is encouraged, where failures are seen as learning opportunities, and where continuous learning and adaptation are encouraged. The lack of such culture constrains the application and growth potential of GenAI technology.
Future Developments and Implications
As more organizations become familiar with GenAI, a cultural shift is likely to occur – one that sees organizations not just implementing AI but also integrating it in all aspects of their work. This means the focus will not only be on implementing, but more importantly, on where and how AI can add value to the organization. As this approach germinates and matures, it could lead to an industrial revolution where the majority of organizations are data-focused.
Actionable Advice
To leverage GenAI and AI’s full potential, organizations should consider the following steps:
During the planning stages, ensure you identify where and how the AI technology can add value to your organization. This will ensure a return on your investment.
Promote an AI-friendly culture, where employees understand and appreciate the benefits AI can offer. This culture should support continuous learning, data literacy, and encourage data-driven decision-making.
Secure a buy-in from all levels of the organization. Achieving real success requires full commitment and cooperation from all stakeholders involved.
Remember: To fully embrace Generative AI, focus not only on the technology itself but also on identifying where value can be created, and fostering an environment that promotes creativity, agility, adaptability and continuous learning.
[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Last year, Antonio Hegar of the R Glasgow user group shared the challenges of organizing an R user group in Glasgow. The group now regularly hosts events, attracting local R users and experts. Antonio shared with the R Consortium the group’s journey and anecdotes that have helped it to build momentum. He also shared his hopes for maintaining this momentum, with speakers lined up for the next three events.
Antonio also discussed his work with R for his PhD research in data analysis for healthcare. He spoke about the ever-evolving nature of R and some of the new developments that have been useful for his research.
What’s new with the Glasgow R User Group since we last talked?
As we discussed the last time, one of the most pressing issues we faced as a local R user group was our lack of engagement with the community. This is particularly interesting given that both Glasgow and Edinburgh have their own R user groups. Both cities are only an hour apart, yet we weren’t seeing the same level of engagement as other groups in the UK.
To address this issue, we have been strategizing and holding several meetings. To summarize, we discussed improving our marketing and engagement with our audience. We also decided to hold one final meeting at the end of the year.
Besides our internal meetings, we also hosted two R events. One of the group’s founders, Andrew Baxter, a postgraduate researcher at the University of Glasgow, has been instrumental in organizing these events. Because he works at the University of Glasgow, he has access to many resources, including physical venues and fellow academics, and this has been a major plus in facilitating our engagement.
Previously, I had been trying to do what other groups have done: finding random venues and hosting events there. However, this was not as effective as we had hoped.
From the discussions that we had, as well as listening to our audience, we learned that people who are interested in working with R have very specific wants and needs. If these needs are not being met, then it is unlikely that people will be attracted to the group, and as such, we had to reframe our approach to attracting people.
We recognized it is key to have a specific venue. We now hold the vast majority of our meetings at the University of Glasgow. This seems to be very appealing to people, as they enjoy the academic setting. Furthermore, the University of Glasgow is well known and respected, not just in Scotland but across the world, and this adds weight to the appeal, and the reputation helps to draw people in.
The second thing that proved essential was consistency. Having a meeting for one month and then having a gap breaks the flow, and sends the wrong message to your audience. When people see that you are committed to what you want to do, they respond to that and are more likely to be engaged in the community.
We had a final meeting in December, and Andrew Baxter contacted Mike Smith, one of the local R Consortium representatives. He is based in Dublin, Ireland, but frequently travels back and forth to Scotland. He leveraged this network to recommend speakers and topics for the conference. This was particularly helpful in attracting people from industry, who are often interested in the latest developments in R. Mike has been a tremendous asset to the group since our meeting in December.
A venue, people on the inside of the industry, and a consistent schedule have been the three key components. Three speakers have been lined up for early 2024: one for January, February, and March.
We will not have much difficulty finding additional speakers based on the academic and industrial contacts. At most, we must determine who will speak on which topic and when they will be available, which is not difficult. Based on the current situation, it does not appear that we will have any trouble maintaining momentum and keeping the meetings going.
What industry are you currently in?
I am a PhD student at Glasgow Caledonian University. My PhD research focuses on data science applied to health, specifically using machine learning to predict disease outcomes.
I am interested in understanding why some people who experience an acute illness, such as COVID-19, develop long-term health problems. In some countries, up to 10% of people who contract COVID-19 never fully recover. These individuals may experience permanent shortness of breath, headaches, brain fog, joint pain, and other symptoms.
I am currently researching how data science can be used to answer questions such as these, using large data sets from, for example, the NHS. R is the primary tool used for this research.
When we last spoke, I was in the second year of my PhD. I am now in my third and final year. I should be submitting my dissertation before the end of this year. Balancing my commitments to R, my PhD work, and other activities is challenging, but I managed to pull it off.
How do you use R for your work?
I extensively use R. One of R’s most beneficial aspects is that it’s constantly evolving and expanding. As a result, it is impossible to master everything. You do not master R; rather, you master certain R areas relevant to your research or area of expertise. In my research, I found several medical statistics and biostatistics packages extremely useful. I was aware of a few of them but unaware of how many there were.
For instance, consider the following brief instance of a task that I began working on yesterday. In the context of medical data, particularly when analyzing health conditions, it is common for individuals to have multiple health conditions that are often linked. This often makes it more difficult for doctors to treat and for individuals to recover fully.
If I were to apply classical statistics using base R, this would be very time-consuming. However, I recently discovered that there are also medical statistical packages specifically designed for analyzing data for individuals with comorbidities. For example, if I wanted to analyze individuals suffering from diabetes, hypertension, cancer, obesity, or a combination of different diseases, I could do so using these packages.
In addition, it is possible to create a score that can be used to estimate the likelihood of a person who becomes ill and goes to the hospital, stays for a long time, or dies. It is possible to perform this task using regular statistics and programming in R, but it would be very tedious. In my case, I am working on a tight deadline and need to submit my work by a specific date. I believe the package I am speaking of is the comorbidity package in R. It was developed recently by researchers at the London School of Hygiene & Tropical Medicine and is an invaluable tool.
I work with NHS data through a third-party organization that controls it and allows me access to it. Last year in December, they provided me with brief training and taught me how to access their data on a DBS SQL server using SQL queries embedded in R code.
Learning about very niche packages, which are very content-specific or topic-specific, is very useful for researchers like myself. Integrating different programming languages is also useful because they are all merging into one. Python, Julia, R, and Java have a lot of cross-fertilization and use between the different programming and software development packages. If R continues to streamline its services to integrate other packages, it will be a win-win situation for everyone.
What is the R Community like in Glasgow? What efforts are you putting in to keep your group inclusive for all participants?
We are not trying to cater to one specific level of expertise. The last meeting had a good mix of participants, including PhD students, undergraduates, people who have worked in finance and tech, software developers, and an individual from the R Consortium in Dublin, Ireland.
The group is open to everyone, and we are trying to mix participants with different needs, wants, and interests. It is understood that attendees will choose which events they would like to attend. Certain events will focus more on entry-level individuals beginning their R learning journey. For example, they are interested in learning what they can do with ggplotand the tidyverse.
Mid-level individuals, including graduate students, will also be targeted. A portion of these students are novices, but many are more experienced. They have a strong foundation in R and RStudio or Posit. However, they are now seeking to learn more advanced techniques, such as how to perform specific calculations. For instance, they may be working with quantitative or qualitative data and are now at the analysis stage of their research and wonder what to do next.
Finally, there are a small number of highly experienced programmers who are interested in learning more about integrating specific features into a package. They may want to know how to create their packages and launch them. They are also interested in learning about Shiny and Quarto and how they can use these tools for their businesses or companies.
Most individuals fall into the beginner or intermediate levels, but there are a few who are highly advanced and still interested in attending. As a result, most of the talks will be geared toward individuals with intermediate-level experience. This will ensure that the material is not too advanced for beginners but also not too basic for advanced learners.
Can you tell us about a recent event that received a good response from the audience?
Of the recent events that were particularly successful, I would like to highlight the one held in November last year. It was titled “Flex Dashboard: Displaying data with high impact using minimal code.” Erik Igelström, a researcher from the University of Glasgow, presented his use of R Shiny to display data from the Scottish government. The presentation was highly informative and demonstrated the potential of Shiny to present data in a user-friendly manner.
The meeting was attended by a representative from R Software in Ireland, who provided us with a wealth of information about industry developments, including the latest trends and upcoming projects. As a result of this meeting, 2023 was the most productive year for our R meetup.
The preceding meetups were not entirely unproductive, but the most recent one, held in November last year, laid the groundwork for the current initiatives.
Professor Lisa DeBruine will be presenting at this Meetup. She is a professor of psychology at the University of Glasgow in the School of Psychology and Neuroscience. She is a member of the UK Reproducibility Network and works in PsyTeachR. She has used the psych package extensively and many other good packages in R to conduct her psychological research. Her presentation will be on how to simulate data to prepare analyses for pre-registration.
As those who work with data know, it is sometimes counterproductive to work directly with the data itself. For example, if one is building a model, it is not advisable to use all of the data to build the model, especially if the data set is small. This is because there is a risk of over-fitting.
Generating dummy data for quantitative data is a well-known technique. However, generating dummy data for qualitative data is rare. This is because qualitative data is often unstructured and difficult to quantify. Professor Lisa DeBruine is an expert in generating dummy data for qualitative data.
SPSS is a popular statistical software package used by sociologists, anthropologists, and psychologists. However, R is a more powerful and flexible tool that can perform a wider range of analyses. Learning to use R and the psych package can greatly simplify the process of conducting factor analysis. Additionally, R can be used to perform calculations and analyses that are impossible in SPSS.
Our team is highly capable, and we have another team member who is particularly skilled in generating graphics and designing flyers. He has been responsible for creating the promotional material and has done an excellent job.
Long-Term Implications and Future Developments Based on R Glasgow User Group’s Experience
From the perspective of Antonio Hegar of the R Glasgow User Group, it’s clear that establishing a thriving, local user group for the R language presents unique challenges and opportunities. These insights can offer valuable takeaways for leaders in tech communities everywhere, aimed at facilitating knowledge exchange, increasing engagement and driving innovation.
Understanding The Importance of Venue
The R Glasgow User Group’s journey suggests that a recognized and respected venue can play a key role in attracting participants. Holding events at the University of Glasgow provided attendees with an academic setting where they felt comfortable and engaged. The university’s reputation also served to lend credibility to the group and its events. Therefore, investing time in selecting an appropriate venue could pay dividends in the group’s growth and success.
Maintaining Consistency
Antonio emphasized that consistency is crucial to sustain user engagement and community participation. Regular meetings without significant breaks between them give attendees a sense of continuity and dedication, fostering greater involvement within the community. This consistency can also help in enhancing the group’s visibility and boosting its reputation within the wider tech community.
Utilizing Diverse Skill Sets
The active involvement of experts like Andrew Baxter (a postgraduate researcher at the University of Glasgow) and Mike Smith (a local R Consortium representative) demonstrated how diversity of knowledge and skills can significantly influence a user group’s success. Purposely employing individuals with different backgrounds, experiences, and areas of expertise can lead to a broad range of topics, comprehensive discussions, and an inclusive learning environment.
Skill-Level Inclusivity
The Glasgow R User Group aims to cater to varying levels of expertise, from beginners to advanced programmers. Tailoring event content to different learning levels contributes to an inclusive community environment and better serves the diverse needs of its members. This approach can also promote a richer exchange of ideas across different experience levels, potentially leading to unexpected insights and innovation.
Innovative Use of R for Data Science in Healthcare
Antonio’s use of R for his PhD research in healthcare data analytics signifies the potential future applications of R in different industries. The evolution of R-specific packages for varied niche analysis can streamline research processes and further drive the adoption of R in various fields, from medical statistics to software development.
Looking Ahead
Based on the current development, the R Glasgow User group has a promising future starting with lined-up speakers for the early part of 2024. The group plans to tap into their growing network of both academic and industrial contacts, ensuring a diversity of topics and expertise
Actionable Advice
Choose a venue known and respected by your target audience to enhance participation
Maintain a consistent event schedule to foster community growth
Use diverse skill sets within the group to enrich event content and discussions
Create content tailored to different experience levels to cater to all members
Keep an eye on evolving R-specific packages that can significantly streamline analysis processes in various fields
In Conclusion
The success of the R Glasgow user group is indeed a testament to the importance of community in the tech world. The shared experiences, resources, and insights foster learning, inspires new ideas, and catalyzes innovation. As the ever-evolving nature of R continues to open up opportunities, groups like the one in Glasgow provide invaluable support and opportunities for collaborations around the world.
[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
About
Capybara is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This technique is particularly useful when estimating linear models with multiple group fixed effects.
The software can estimate GLMs from the Exponential Family and also Negative Binomial models but the focus will be the Poisson estimator because it is the one used for structural counterfactual analysis in International Trade. It is relevant to add that the IWLS estimator is equivalent with the PPML estimator from Santos-Silva et al. 2006
Tradition QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin 1962 article on vector projections offers important time and memory savings without compromising numerical stability in the estimation process.
The software heavily borrows from Gaure 20213 and Stammann 2018 works on the OLS and IWLS estimator with large k-way fixed effects (i.e., the Lfe and Alpaca packages). The differences are that Capybara uses an elementary approach and uses a minimal C++ code without parallelization, which achieves very good results considering its simplicity. I hope it is east to maintain.
The summary tables are nothing like R’s default and borrow from the Broom package and Stata outputs. The default summary from this package is a Markdown table that you can insert in RMarkdown/Quarto or copy and paste to Jupyter.
Demo
Estimating the coefficients of a gravity model with importer-time and exporter-time fixed effects.
library(capybara)
mod <- feglm(
trade ~ dist + lang + cntg + clny | exp_year + imp_year,
trade_panel,
family = poisson(link = "log")
)
summary(mod)
Capybara represents a key advancement in software for estimating Generalized Linear Models (GLMs) from the Exponential Family and Negative Binomial models. Its key points of differentiation hinge on time and memory savings, minimal C++ code usage, and ease of maintenance. These facets of development have significant implications for its long-term adoption and use within the context of structural counterfactual analysis in International Trade, as well as other research fields that make broad use of GLMs.
Implications of Capybara’s Approaches
One of the most noteworthy underpinnings of Capybara is the efficiency it provides for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This is highly advantageous when estimating linear models with multiple group fixed effects.
The speed and small memory footprint are particularly impressive, underlining the benefits of a Halperin 1962 vector projections-based method. Traditional QR estimation, which could become unfeasible due to additional memory requirements, is thus surpassed by Capybara. In doing this, Capybara fortifies itself as a tool that could provide significant benefits for research economies going forward.
Future Developments: A Potential Goldmine of Enhancements
The present differentiators between Capybara and other software packages such as Alpaca and Base R suggest promising potential for enhancements. Given its comparatively lower need for memory allocation and faster processing times, Capybara could evolve to cater to more elaborate statistical analyses without the risk of compromising numerical stability.
The output quality of summary tables generated by Capybara is also an advantage, being similar to those from the Broom package and Stata outputs. This feature might encourage adoption by those who prefer cleaner, easily interpretable outputs. Future additions to this feature could be more customizations and improvements in formatting functions.
Actionable Advice
For Researchers: If you deal with estimation of GLMs from the Exponential Family and Negative Binomial models or directly involved in structural counterfactual analysis, adopting Capybara can likely enhance your productivity. Its succinct and efficient approach can save time, and its use of iterations in lieu of larger memory requirements means it can function exceptionally well even on low-memory systems.
For Developers: Albeit being a revelation for exemplary estimation, Capybara still utilizes elementary C++ code without parallelization. Work on integrating parallel workflows into the system, or optimizing the C++ code further for speedier process execution, could bring out even better results.
Conclusion
As a compact and memory-efficient tool, Capybara has much to offer in GLM estimations and beyond. Its novelty lies in its lean nature, the amenability of its processes and its systematic simplicity. Adapting it for mainstream use and tailoring it meticulously could reshape the way we view GLM estimations – for the better.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
As mentioned about a million times on this blog, last year I read Git in practice by Mike McQuaid and it changed my life – not only giving me bragging rights about the reading itself. I decided to give Pro Git by Scott Chacon a go too. It is listed in the resources section of the excellent “Happy Git with R” by Jenny Bryan, Jim Hester and others. For unclear reasons I bought the first edition instead of the second one.
Git as an improved filesystem
In the Chapter 1 (Getting Started), I underlined:
“[a] mini filesystem with some incredibly powerful tools built on top of it”.
Awesome diagrams
One of my favorite parts of the book were the diagrams such as the one illustrating “Git stores data as snapshots of the project over time”.
A reminder of why we use Git
“after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository.”
The last chapter in the book, “Git internals”, include a “Data recovery” section about git reflog and git fsck.
One can negate patterns in .gitignore
I did not know about this pattern format. In hindsight it is not particularly surprising.
git log options
The format option lets one tell Git how to, well, format the log. I mostly interact with the Git log through a GUI or the gert R package, but that’s good to know.
The book also describes how to filter commits in the log (by date, author, committer). I also never do that with Git itself, but who knows when it might become useful.
A better understanding of branches
I remember reading “branches are cheap” years ago and accepting this as fact without questioning the reason behind the statement. Now thanks to reading the “Git Branching” chapter, but also Julia Evans’ blog post “git branches: intuition & reality”, I know they are cheap because they are just a pointer to a commit.
Likewise, the phrase “fast forward” makes more sense after reading “Git moves the pointer forward”.
The “Git Branching” chapter is also a place where diagrams really shine.
Is rebase better than merge
“(…) rebasing makes for a cleaner history. If you examine the log of a rebased branch, it looks like a linear history.”
“Rebasing replays changes from one line of work onto another in the order they were introduced, whereas merging takes the endpoints and merges them together.”
Reading this reminds me of the (newish?) option for merging pull requests on GitHub, “Rebase and merge your commits” – as opposed to merge or squash&merge.
Don’t rebase commits and force push to a shared branch
The chance of getting the same SHA-1 twice in your repository
“A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.”
Ancestry references
The name of references with “^” or “~” (or both!) is “ancestry references”. Both HEAD~3 and HEAD^^^ are “the first parent of the first parent of the first parent”.
Double dot and triple dot
I do not intend to try and learn this but…
The double-dot syntax “asks Git to resolve a range of commits that are reachable from one commit but aren’t reachable from another”.
The triple-dot syntax “specifies all the commits that are reachable by either of the two references but not by both of them”.
New-to-me aspects of git rebase -i
git rebase -i lists commits in the reverse order compared to git log. I am not sure why I did not make a note of this before.
I had not really realized one could edit single commits in git rebase -i. When writing “edit”, rebasing will stop at the commit one wants to edit. That strategy is actually featured in the GitHub blog post “Write Better Commits, Build Better Projects”.
Plumbing vs porcelaine
I had seen these terms before but never taken the time to look them up. Plumbing commands are the low-level commands, porcelain commands are the more user-friendly commands. At this stage, I do not think I need to be super familiar with plumbing commands, although I did click around in a .git folder out of curiosity.
Removing objects
There is a section in the “Git internals” chapter called “Removing objects”. I might come back to it if I ever need to do that… Or I’d use git obliterate from the git-extras utilities!
Conclusion
Pro Git was a good read, although I do wish I had bought the second edition. I probably missed good stuff because of this! My next (and last?) Git book purchase will be Julia Evans’ new Git zine when it’s finished. I can’t wait!
Understanding Git: From Reading Notes on ‘Pro Git’ by Scott Chacon
Git, the all-important version control system, is often noted for its learning curve. A recent blog post delves into insights gained from reading ‘Pro Git’ by Scott Chacon, expanding our understanding of Git’s capabilities and tips for using it efficiently.
Git as an Improved Filesystem
“[a] mini filesystem with some incredibly powerful tools built on top of it.”
This sentiment resonates through the book, emphasizing the unique features of Git, most notably the way it stores data as snapshots over time. This nuanced way of data handling ensures it’s difficult to lose a commit and furthermore strengthens the structure with regular pushes to another repository.
The Importance of Good Comprehension
Branches: The blog post highlights the importance of understanding Git-specific terms like ‘branches,’ referring to pointers to a commit.
Fast Forward: This term, used when Git moves the pointer forward, is a sign that the user has a good grasp of basic Git operations.
Rebase vs Merge: ‘Rebase’ is preferable for a cleaner history, resulting in a linear look when examining the log of a rebased branch. On the other hand, ‘merge’ takes the endpoints and merges them together.
The blog post urged caution towards rebasing commits and force pushing to a shared branch, stating that there could be potential risks involved in such tasks.
Familiarizing Oneself with Ancestry References
Ancestry references entail advanced usage of Git involving symbols like “^” or “~”. For instance, both HEAD~3 and HEAD^^^ point to “the first parent of the first parent of the first parent”.
Understanding Commit Syntax
Differentiating between double-dot and triple-dot syntax is important in Git. The former instructs Git to resolve a range of commits reachable from one commit but not from another, while the latter involves all commits reachable by either of two references, but not by both.
Editing with git rebase-i
‘git rebase -i’ enables editing of individual commits – this strategy is extensively covered in the GitHub blog post “Write Better Commits, Build Better Projects”. The command also lists commits in reverse order compared to git log.
Plumbing vs Porcelain Commands
These are Git’s low-level and user-friendly commands, respectively. Mastering both is crucial for a comprehensive understanding of Git.
Conclusion
Reading ‘Pro Git’ offers invaluable insights into Git management, highlighting aspects like rebasing, branching, commit editing and understanding various commands. Such knowledge can deeply improve one’s ability to navigate through version control systems when programming. As part of constant improvement and knowledge expansion, it’s advisable to keep exploring new resources and guides available in this area.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Introduction
Yesterday I discussed the use of the function internal_make_wflw_predictions() in the tidyAML R package. Today I will discuss the use of the function extract_wflw_pred() and the brand new function extract_regression_residuals() in the tidyAML R package. We breifly saw yesterday the output of the function internal_make_wflw_predictions() which is a list of tibbles that are typically inside of a list column in the final output of fast_regression() and fast_classification(). The function extract_wflw_pred() takes this list of tibbles and extracts them from that output. The function extract_regression_residuals() also extracts those tibbles and has the added feature of also returning the residuals. Let’s see how these functions work.
The new function
First, we will go over the syntax of the new function extract_regression_residuals().
The function takes two arguments. The first argument is .model_tbl which is the output of fast_regression() or fast_classification(). The second argument is .pivot_long which is a logical argument that defaults to FALSE. If TRUE then the output will be in a long format. If FALSE then the output will be in a wide format. Let’s see how this works.
Example
# Load packages
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model
tidymodels_prefer() # good practice when using tidyAML
rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
.data = mtcars,
.rec_obj = rec_obj,
.parsnip_eng = c("lm","glm","stan","gee"),
.parsnip_fns = "linear_reg"
)
Let’s break down the R code step by step:
Loading Libraries:
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model
Here, the code is loading several R packages. These packages provide functions and tools for data analysis, modeling, and visualization. tidyAML and tidymodels are particularly relevant for modeling, while tidyverse is a collection of packages for data manipulation and visualization. multilevelmod is included for the Generalized Estimating Equations (gee) model.
Setting Preferences:
tidymodels_prefer() # good practice when using tidyAML
This line of code is setting preferences for the tidy modeling workflow using tidymodels_prefer(). It ensures that when using tidyAML, the tidy modeling conventions are followed. Tidy modeling involves an organized and consistent approach to modeling in R.
Creating a Recipe Object:
rec_obj <- recipe(mpg ~ ., data = mtcars)
Here, a recipe object (rec_obj) is created using the recipe function from the tidymodels package. The formula mpg ~ . specifies that we want to predict the mpg variable based on all other variables in the dataset (mtcars).
This part involves using the fast_regression function. It performs a fast regression analysis using various engines specified by .parsnip_eng and specific functions specified by .parsnip_fns. In this case, it includes linear models (lm), generalized linear models (glm), Stan models (stan), and the Generalized Estimating Equations model (gee). The results are stored in the frt_tbl table.
In summary, the code is setting up a tidy modeling workflow, creating a recipe for predicting mpg based on other variables in the mtcars dataset, and then performing a fast regression using different engines and functions. The choice of engines and functions allows flexibility in exploring different modeling approaches.
Now that we have the output of fast_regression() stored in frt_tbl, we can use the function extract_wflw_pred() to extract the predictions and from the output. Let’s see how this works. First, the syntax:
extract_wflw_pred(.data, .model_id = NULL)
The function takes two arguments. The first argument is .data which is the output of fast_regression() or fast_classification(). The second argument is .model_id which is a numeric vector that defaults to NULL. If NULL then the function will extract none of the predictions from the output. If a numeric vector is provided then the function will extract the predictions for the models specified by the numeric vector. Let’s see how this works.
extract_wflw_pred(frt_tbl, 1)
# A tibble: 64 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 54 more rows
extract_wflw_pred(frt_tbl, 1:2)
# A tibble: 128 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 118 more rows
extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
# A tibble: 256 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 246 more rows
The first line of code extracts the predictions for the first model in the output. The second line of code extracts the predictions for the first two models in the output. The third line of code extracts the predictions for all models in the output.
Now, let’s visualize the predictions for the models in the output and the actual values. We will use the ggplot2 package for visualization. First, we will extract the predictions for all models in the output and store them in a table called pred_tbl. Then, we will use ggplot2 to visualize the predictions and actual values.
pred_tbl <- extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
pred_tbl |>
group_split(.model_type) |>
map((x) x |>
group_by(.data_category) |>
mutate(x = row_number()) |>
ungroup() |>
pivot_wider(names_from = .data_type, values_from = .value) |>
ggplot(aes(x = x, y = actual, group = .data_category)) +
geom_line(color = "black") +
geom_line(aes(x = x, y = training), linetype = "dashed", color = "red",
linewidth = 1) +
geom_line(aes(x = x, y = testing), linetype = "dashed", color = "blue",
linewidth = 1) +
theme_minimal() +
labs(
x = "",
y = "Observed/Predicted Value",
title = "Observed vs. Predicted Values by Model Type",
subtitle = x$.model_type[1]
)
)
[[1]]
[[2]]
[[3]]
[[4]]
Or we can facet them by model type:
pred_tbl |>
group_by(.model_type, .data_category) |>
mutate(x = row_number()) |>
ungroup() |>
ggplot(aes(x = x, y = .value)) +
geom_line(data = . %>% filter(.data_type == "actual"), color = "black") +
geom_line(data = . %>% filter(.data_type == "training"),
linetype = "dashed", color = "red") +
geom_line(data = . %>% filter(.data_type == "testing"),
linetype = "dashed", color = "blue") +
facet_wrap(~ .model_type, ncol = 2, scales = "free") +
labs(
x = "",
y = "Observed/Predicted Value",
title = "Observed vs. Predicted Values by Model Type"
) +
theme_minimal()
Ok, so what about this new function I talked about above? Well let’s go over it here. We have already discussed it’s syntax so no need to go over it again. Let’s just jump right into an example. This function will return the residuals for all models. We will slice off just the first model for demonstration purposes.
Understanding the New Functions in the tidyAML R Package
The post discusses two new functions that have been introduced in the tidyAML R package, namely extract_wflw_pred() and extract_regression_residuals(). These functions are useful for extracting certain data from the output of some specific predictive analysis models.
Function Overview
The function extract_wflw_pred() takes a list of tibbles, typically found in the final output of fast_regression() and fast_classification() functions, and extracts them. Furthermore, the function extract_regression_residuals() performs a similar task but has an additional feature of returning residuals. The feature consequently aids in the further analysis process by revealing the difference between predicted and actual values.
Using the New Functions
The primary use of these added functions involves invoking them after implementing respective predictive models. After using the fast_regression() or fast_classification() function, you can directly use these functions on the output data. The users can select the data format in extract_regression_residuals(); this mainly involves selecting between wide or long formats.
Implementation Example
The blog post provides a comprehensive example of utilizing these functions using the mtcars dataset. Starting from preparing the data for modeling using a tidymodels and tidyAML workflow up to performing the regression and extraction of predictions.
Three scenarios demonstrated the use of extract_wflw_pred(). The first saw an extraction of predictions for the first model in the output, while the second extracted predictions for the first two models. The third example extracted predictions of all models in the output.
The Long-term Implications and Possible Future Developments
With the addition of these new functions, users can anticipate better data extraction process from their predictive analysis models. The ability for extract_wflw_pred() to specify the model from which you wish to extract predictions, and extract_regression_residuals()’s optional output formatting and provision of residuals can be considered as significant strides in predictive data analysis.
As for future developments, there may be an introduction of more functions that further enhance the extraction process or provide more data insights. Moreover, improvements or updates on these functions can also be looked upon. Additionally, an interesting prospect for upcoming additions could be a function that automatically optimizes or selects the best predictive model based on certain criteria.
Actionable Advice
In light of these insights, it is crucial for users working with predictive data analysis models to understand how these new functions operate and hence gain maximum benefit out of them:
Be sure to understand both the extract_wflw_pred() and extract_regression_residuals() functions and what they can offer.
Explore different scenarios for using these functions, like extracting predictions from different models, varying the data output format.
Exploit the residuals provided by extract_regression_residuals() to enhance your understanding and prediction capability of your models.
By doing so, you will be able to harness these additions fully to optimize your work with the tidyAML R package.