R-bloggers
https://www.r-bloggers.com R news and tutorials contributed by hundreds of R bloggers Thu, 22 Jan 2026 23:59:00 +0000 en-US
hourly
1 https://wordpress.org/?v=5.5.17
11524731 Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps
https://www.r-bloggers.com/2026/01/using-ellmer-for-dynamic-alt-text-generation-in-shiny-apps/
Thu, 22 Jan 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/ellmer-dynamic-alt-text/
Alt Text First things first, if you haven’t heard of or used alt text before, it is a brief written description of an image that explains context and purpose. It is used to improve accessibility by allowing screen readers to describe images, or p…
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Alt Text
First things first, if you haven’t heard of or used alt text before, it
is a brief written description of an image that explains context and
purpose. It is used to improve accessibility by allowing screen readers
to describe images, or provide context if an image fails to load. For
writing good alt text see this article by
Havard,
but some good rules of thumb are:
Keep it concise and relevant to the context of why the image is being
used.
Screen reader will already say “Image of …” so we don’t need to
include this unless the style is important (drawing, cartoon etc).
Alt Text within Apps and Dashboards
I don’t need to list the positives of interactive apps and dashboards,
however one of the main ones is interactivity and allowing users to
explore data in their own way. This is a great thing most of the time,
but one pitfall that is often overlooked is interactivity can overshadow
accessibility. Whether it’s a fancy widget that’s hard (or impossible)
to use via keyboard or interactive visualisations without meaningful
alternative text.
In this post, we’ll look at a new approach to generating dynamic alt
text for ggplot2 charts using {ellmer},
Posit’s new R package for querying large language models (LLM) from R.
If you are using Shiny for Python then
chatlas will be of interest to
you.
Why Dynamic Alt Text Needs Care
Automatically generating alt text is appealing, but production Shiny
apps have constraints:
Plots may re-render frequently
API calls can fail or be rate-limited
Accessibility should degrade gracefully, not break the app
A good implementation should be consistent, fault-tolerant, and cheap
to run.
Using {ellmer} in a Shiny App
The first step is setting up a connection to your chosen LLM, I am using
Google Gemini Flash-2.5 as there is a generous free tier but other model
and providers are available. In a Shiny app, this can done outside the
reactive context:
library(ellmer)
gemini <- chat_google_gemini()
## Using model = "gemini-2.5-flash".
Note: You should have a Google Gemini key saved in you .Renviron file as
GEMINI_API_KEY, this way the {ellmer} function will be able to find
it. More information on generating a Gemini API key can be found, in the
Gemini docs.
Then we have the function for generating the alt text:
library(ggplot2)
generate_alt_text = function(ggplot_obj, model) {
temp <- tempfile(fileext = ".png")
on.exit(unlink(temp))
ggsave(
temp,
ggplot_obj,
width = 6,
height = 4,
dpi = 150
)
tryCatch(
model$chat(
"
Generate concise alt text for this plot image.
Describe the chart type, variables shown,
key patterns or trends, and value ranges where visible.
",
content_image_file(temp)
),
error = function(e) {
"Data visualisation showing trends and comparisons."
}
)
}
The function has a few features that will keep the output more reliable:
Consistent image size and resolution – helps model reliability when
reading axes and labels.
Explicit cleanup of temporary files – we don’t need to save the images
once text is generated.
Error handling – if the model call fails, the app still returns usable
alt text. We kept our fallback text simple for demonstration purposes,
but you can attempt to add more detail.
External model initialisation – only created once and passed in,
rather than re-created on every reactive update.
Examples
In this section will just create a few example plots then see what the
LLM generates.
simple_plot_alt = generate_alt_text(simple_plot, gemini)
paste("Alt text generated by AI: ", simple_plot_alt)
Alt text generated by AI:
Scatter plot showing Sepal.Length on the y-axis (ranging from
approximately 4.5 to 8.0) versus Sepal.Width on the x-axis (ranging
from approximately 2.0 to 4.5). The data points appear to form two
distinct clusters: one with Sepal.Width between 2.0 and 3.0 and
Sepal.Length between 5.0 and 8.0, and another with Sepal.Width between
3.0 and 4.5 and Sepal.Length between 4.5 and 6.5.
plot_alt =
generate_alt_text(plot, gemini)
paste("Alt text generated by AI: ", plot_alt)
Alt text generated by AI:
Scatter plot showing Sepal.Length on the y-axis (range 4.5-8.0) versus
Sepal.Width on the x-axis (range 2.0-4.5), with points colored by
Species. Red points, labeled “setosa”, form a distinct cluster with
higher Sepal.Width (3.0-4.5) and lower Sepal.Length (4.5-5.8). Blue
points, “virginica”, tend to have higher Sepal.Length (5.5-8.0) and
moderate Sepal.Width (2.5-3.8). Green points, “versicolor”, are in
between, with moderate Sepal.Length (5.0-7.0) and Sepal.Width
(2.0-3.5), overlapping with virginica.
complicated_plot_alt =
generate_alt_text(complicated_plot, gemini)
paste("Alt text generated by AI: ", complicated_plot_alt)
Alt text generated by AI:
Scatter plot showing Sepal.Length on the y-axis (range 4.0-8.0) versus
Sepal.Width on the x-axis (range 2.0-4.5). Points and linear
regression lines are colored by Iris species. Red points, “setosa”,
cluster with lower Sepal.Length (4.0-5.8) and higher Sepal.Width
(2.8-4.4). Green points, “versicolor”, and blue points, “virginica”,
largely overlap, showing higher Sepal.Length (5.0-8.0) and moderate
Sepal.Width (2.0-3.8), with “virginica” generally having the longest
sepals. All three species exhibit a positive linear correlation,
indicated by their respective regression lines and shaded confidence
intervals, where increasing sepal width corresponds to increasing
sepal length.
As we can see the alt text can be very good and informative when using
LLMs. One alternative that I want to point out is actually including a
summary of the data behind the plot. This way screen reader users can
still gain insight from the plot.
Using Dynamic Alt Text in Shiny
Once generated, the alt text can be supplied directly to the UI:
Via the alt argument of plotOutput()
Or injected into custom HTML for more complex layouts
Because the text is generated from the rendered plot, it stays in sync
with user inputs and filters.
Other Considerations
Some apps may be more complicated and/or have a high number of users.
These type of apps will need a bit more consideration to include
features like this:
Caching alt text for unchanged plots to reduce API usage
Prompt augmentation with known variable names or units
Manual overrides for critical visuals
Conclusion
AI-generated alt text works best as a supporting tool, not a replacement
for accessibility review. I have also found it helpful to let users know
that the alt text is AI generated so they know to take it with a pinch
of salt.
Dynamic alt text is a small feature with a big impact on inclusion. By
combining Shiny’s reactivity with consistent rendering, error handling,
and modern LLMs, we can make interactive data apps more accessible by
default whilst not increasing developer burden.
For updates and revisions to this article, see the original post
To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.
Continue reading: Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps]]>
398453 futurize: Parallelize Common Functions via a “Magic” Touch 🪄
https://www.r-bloggers.com/2026/01/futurize-parallelize-common-functions-via-a-magic-touch-%f0%9f%aa%84/
Thu, 22 Jan 2026 00:00:00 +0000 https://www.jottr.org/2026/01/22/futurize-0.1.0/
I am incredibly excited to announce the release of the futurize package. This launch marks a major milestone in the decade-long journey of the Futureverse project.
Since the inception of the future ecosystem, I (and others) have envisioned a tool …
[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I am incredibly excited to announce the release of the futurize package. This launch marks a major milestone in the decade-long journey of the Futureverse project.
Since the inception of the future ecosystem, I (and others) have envisioned a tool that would make concurrent execution as simple as possible with minimal change to your existing code – no refactoring, no new function names to memorize – it should just work and work the same everywhere. I’m proud to say that with futurize this is now possible – take your map-reduce call of choice and pipe it into futurize(), e.g.
y <- lapply(x, fcn) |> futurize()
That’s it – a “magic” touch by one function! Easy!
(*) Yeah, there’s no magic going on here – it’s just the beauty of R in action.
Unifying the ecosystem
futurize(). On the left, sequential functions such as lapply(…), purrr::map(…), foreach(…) %do%, plyr::llply(…), and others flow into a central box labeled |> futurize() with magic-wand icons, indicating automatic transformation. On the right, the transformed code fans out to multiple parallel workers (Worker 1, Worker 2, Worker 3, …), whose outputs are combined into a single ‘Results’ node.” style=”width: 100%; margin: 1em 0 1em 0; border: 1px solid #eee; padding: 0.2ex;”/>
One of the biggest hurdles in concurrent R programming has been the fragmentation of APIs and behavior. Packages such future.apply, furrr, and doFuture have partly addressed this. While they have simplified it for developers and users, they all require us to use slightly different function names and different parallelization arguments for controlling standard output, messages, warnings, and random number generation (RNG). futurize() changes this by providing one unified interface for all of them. It currently supports:
base: lapply(), sapply(), apply(), replicate(), etc.
The futurize() function works as a transpiler. The term “transpilation” describes the process of transforming source code from one form into another, a.k.a. source-to-source translation. It captures the original expression without evaluating it, then converts it into the concurrent equivalent, and finally executes the transpiled expression. It basically changes lapply() to future.apply::future_lapply() and map() to furrr::future_map() on the fly and it handles options on how to parallelize in a unifying way, and sometimes automatically. This allows you to write parallel code without blurring the underlying logic of your code.
Domain-specific skills
The futurize package includes support also for a growing set of domain-specific packages, including boot, caret, glmnet, lme4, mgcv, and tm. These packages offer their own built-in, often complex, parallelization arguments. futurize abstracts all of that away. For example, instead of having to specify arguments such as parallel = "snow", ncpus = 4, cl = cl, with cl <- parallel::makeCluster(4) when using boot(), you can just do:
# Bootstrap with 'boot'
b <- boot(data, statistic, R = 999) |> futurize()
# Cross-validation with 'caret'
m <- train(Species ~ ., data = iris, method = "rf") |> futurize()
Why I think you should use it
The futurize package follows the core design philosophy of the Futureverse: separate “what” to execute concurrently from “how” to parallelize.
Familiar code: You write standard R code. If you remove |> futurize(), it runs the same.
Familiar behavior: Standard output, messages, warnings, and errors propagate as expected and as-is.
Unified interface: Future options work the same for lapply(), map(), and foreach() and so on, e.g. futurize(stdout = FALSE).
Backend independence: Because it’s built on the future ecosystem, your code can parallelize on any of the supported future backends. It scales up on your notebook, a remote server, or a massive high-performance compute (HPC) cluster with a single change of settings, e.g. plan(future.mirai::mirai_multisession), plan(future.batchtools::batchtools_slurm), and even plan(future.p2p::cluster, cluster = "alice/friends").
Another way to put it, with futurize, you can forget about future.apply, furrr, and doFuture – those packages are now working behind the scenes for you, but you don’t really need to think about them.
I hope that futurize makes your R coding life easier by removing technical details on parallel execution, allowing you to stay focused on the logic you want to achieve. I love to hear how you’ll be using futurize in your R code. For questions, feedback, and feature requests, please reach out on the Futureverse Discussions forum.
May the future be with you!
Henrik
To leave a comment for the author, please follow the link and comment on their blog: JottR on R.
398500 Closing The Loop with Our 2025 Wrap-up
https://www.r-bloggers.com/2026/01/closing-the-loop-with-our-2025-wrap-up/
Thu, 22 Jan 2026 00:00:00 +0000 https://ropensci.org/blog/2026/01/22/yearinreview2025/
At the beginning of 2025, we outlined our goals for the year, our 2024 Highlights. As the year started, our work took place in a far more challenging global context than many of us anticipated. Across many countries, science and research faced funding…
At the beginning of 2025, we outlined our goals for the year, our 2024 Highlights. As the year started, our work took place in a far more challenging global context than many of us anticipated. Across many countries, science and research faced funding cuts, layoffs, and attacks on diversity, equity, accessibility, and inclusion. These conditions reshaped timelines and capacities for institutions and for the people doing the work, but also reinforced why community-driven, open, and care-centered spaces matter.
This post looks back at what we said we would do and how that work played out in practice throughout 2025. We also share our plans for 2026.
Software peer review: steady growth and shared responsibility
Throughout 2025, software peer review remained a core activity at rOpenSci, with 24 submissions peer-reviewed and approved. We had a continuous flow of package submissions, approvals, and ongoing reviews, reflecting both the demand for high-quality research software and the commitment of our editors and reviewers.
New packages joined rOpenSci in the fields of environmental and climate data, geospatial analysis, linguistics, and statistics, among others. Existing packages received more than 7000 commits from 239 unique users, with 62 of them contributing for the first time to rOpenSci! Package maintainers released new versions, performed ongoing maintenance and responded to user inquiries and issues.
Building on last year’s commitments, we further improved the efficiency and usefulness of our software peer review system, and explored its potential as a model for other communities.
We enhanced the review process by improving dashboards, updating guides, and refining pkgcheck and its GitHub Action, making participation easier for authors and reviewers.
Leadership transitions within the program editorial team helped distribute responsibility and ensure continuity, while ongoing community participation demonstrated that peer review remains a collective effort.
Increasing submissions mean we hope to add more editors to our board in 2026. We will also expand our automation processes, and formulate policies on the use of AI/LLMs in developing and maintaining rOpenSci packages.
rOpenSci Software Peer Review Editorial Team
Strengthening shared infrastructure: R-Universe
A major milestone in the R-Universe trajectory was its designation as a Top-Level Project by the R Consortium in late 2024. This status provides guaranteed funding and institutional support for at least three more years and reflects R-Universe’s role as critical infrastructure in the R community. Alongside this, Jeroen Ooms, the R-Universe project lead, gained a seat on the consortium’s Infrastructure Steering Committee, reinforcing R-Universe’s influence on broader R ecosystem development.
Documentation has been a key focus area as R-Universe has matured. In early 2025, the R-Universe team launched a centralized documentation website hosted as a Quarto book, funded in part by Google Season of Docs. This resource consolidates dispersed tutorials, READMEs, and technical notes into a more navigable structure covering browsing, installation, publishing, and infrastructure topics. The goal was not only to improve clarity for new users but also to make contributions to the documentation itself easier for the community.
A refreshed web frontend has improved user experience and maintainability. This redesign streamlined the interface, improved performance, and made it easier for contributors to propose improvements, addressing long-standing usability challenges through the incremental additions of features over time. We are continuing to rewrite and consolidate other pieces of the infrastructure such as the build workflows to make it easier to understand the R-Universe build process, and enable the community to hack workflows and actions to contribute improvements.
Finally a key focus in 2025 that continues in 2026 is expanding collaboration with other organizations managing sizable R package ecosystems. A notable emerging initiative is R-Multiverse, which explores large-scale collaborative package repositories built on R-Universe’s infrastructure. R-Multiverse aims to support curated collections with developmental and production branches, customizable checks, and manual governance steps—features critical to institutional collaboration and stable release practices. In addition, we are helping the BioConductor project modernize their infrastructure and gradually offload some of their build maintenance to us. These illustrate R-Universe’s evolving role not just as a repository, but as an interactive ecosystem management tool.
Multilingual work as infrastructure, not an add-on
In 2024, we framed multilingualism as foundational to open science, not supplementary. We committed to expanding translations, improving workflows, and supporting communities working in languages other than English.
We translated 32,794 words into Portuguese for the Dev Guide and published 14 blog posts on our website, bringing the total to 16 blog posts in Portuguese. In Spanish, we have 62 blog posts and all main website pages are fully translated. Additionally, we host blog content in other languages, including 8 posts in French and 1 in Turkish.
Beyond translation itself, we continued to treat multilingual publishing as infrastructure: refining AI-assisted workflows through the babeldown package; facilitating multilingual Quarto projects through the now peer-reviewedbabelquarto package; enabling our website to support multilingual content; documenting processes and sharing lessons so that other open science communities can adapt and reuse this work.
We also hosted community calls, co-working sessions, and mini-hackathons in Spanish and Portuguese, offering our content and programming to Spanish- and Lusophone-speaking community members. These events and resources reinforce rOpenSci’s commitment to multilingual participation and to meeting communities where they are.
In 2026, we will continue working on English and Spanish content generation and expand Portuguese content and processes, piloting a fully Portuguese software peer review and translating Champions Program Training material into Portuguese.
We will continue our collaborations with other organizations and groups that translate and localize R content and offer not only content but also other aspects of our programs in languages other than English.
rOpenSci Dev Guide Preface in three languages
The Champions Program with Latin American Focus
One of our clearest goals for 2025 was to run the first fully Spanish-language cohort of the rOpenSci Champions Program, with a focus on Latin America.
The 2025 cohort launched with an improved Spanish curriculum, Spanish-language community calls, outreach activities, and mentoring. To accomplish this, we translated all training materials and the infrastructure and artifacts necessary to run a successful cohort into Spanish. The program emphasized sustainable research software as a pillar of open science in the region, while strengthening peer networks among Champions, mentors, and alumni.
Alongside the new cohort, we also evaluated the second cohort to improve the program and better understand its impact. We conducted anonymous surveys and 1:1 interviews with Champions and mentors and used their feedback to improve the program.
The analysis of the survey and interview information shows high satisfaction among participants, both mentors and Champions. All participants agreed that the program prepared Champions to develop their own R packages, improve the quality of their packages, and participate in the software peer review. They also expressed a desire to remain connected with rOpenSci, including participating in the Champions Program again in the future as mentors.
All Champions interviewed reported positive professional outcomes after participating in the program, ranging from employment opportunities, internships, grants, and conference scholarships to presentations at international and local events, generating interest in their project and the program.
Participants also provided constructive feedback on challenges they experienced, including staying in touch with other participants, defining the scope of the project, and time zones. In response, we made adjustments to the program to try to address these challenges for the 2025-2026 cohort.
In 2026, we will have a second cohort in Spanish and will take steps toward longer-term sustainability for the program: testing new formats, strengthening partnerships, and beginning follow-up work with earlier cohorts to better understand the program’s long-term impact.
Countries of origin of applications to the Champions Program
Community participation and pathways into open source
Throughout 2025, our blog, newsletters, and social media outreach continued to amplify community voices. These highlighted not only software releases but also the people, practices, and contexts behind them, reinforcing our focus on community participation.
During 2025, we published 49 posts on the rOpenSci blog. These included 41 blog posts and 8 technotes, with 18 posts authored or co-authored by community members, reflecting the continued openness of the blog to contributions beyond the staff team and the great participation of our community. Also, 60 authors contributed to the blog, 39 of them writing for rOpenSci for the first time, an encouraging sign of ongoing renewal and engagement!
Our content was published in four languages: 13 posts were available in more than one language, and 16 posts were published in at least one non-English language, helping us reach a broader and more diverse audience.
We organized 3 community calls and 13 co-working sessions with 21 unique presenters and more than 200 attendees, with topics ranging from technical skills such as testing, version control, and package development to broader conversations about accessibility, contribution pathways, and getting to know other communities and organizations.
One of the goals for 2025 was to create more opportunities for people to engage with rOpenSci, especially first-time contributors.
In 2025, this took shape through mini-hackathons designed to lower barriers to participation. These events combined practical contribution opportunities with mentoring and social connection, reinforcing the idea that open source work is collaborative and learned through doing. We wrote a guide to share our lessons learned organizing this type of event. Hosting multiple events with this goal throughout the year helped emphasize that open source participation is not a one-off event, but an ongoing practice that grows with continued support.
We also attended and supported several other conferences and events throughout the year, including keynotes at useR!2025 and uRos2025, among several talk and workshops from our staff and community members. These events provided opportunities to connect with community members in person, share our work, and learn from others in the open science ecosystem.
In 2026, we plan to continue organizing community calls in multiple spoken languages, co-working sessions, and mini-hackathons. If you want to collaborate in any of these spaces, please get in touch with us. We also hope to see you in some of the R, open science, and open source conferences around the world: follow our events page and our newsletter to learn which ones we will attend!
We made progress on developing a prototype dashboard that aims to help organizations maintain complex systems of interdependent software components. Our initial prototype organizational dashboard provides insights into code contributors, their repositories, and maintenance status and needs.
We are in conversation with several groups about adopting and adapting the dashboard to highlight the importance of communities in creating and maintaining software.
rOpenSci Staff and Community members at international conferences
Strengthening ties with open science partners
In 2025, we continued to build and strengthen partnerships with organizations that share our commitment to open science, open source, and community-driven development.
Together with a coalition of open science organizations, we held a convening to map out gaps in ecosystem sustainability and to start a long-term process of developing shared support approaches. Finally, as mentioned above, we are working with BioConductor to make R-Universe a larger part of their core infrastructure.
Executive Teams and Community Managers from pyOpenSci, rOpenSci, OLS, The Carpentries and PreReview at the Open Science Communities Convening in September 2025
What we learned
Looking back over 2025, we focused on being true to our core community values and delivering on commitments made the year before. Where plans evolved, they did so in response to capacity, collaboration, and community feedback.
Closing the loop between intention and action is a practice of accountability and learning. It helps us see which activities and strategies complement each other over time, which structures enable others to lead, and where care and sustainability matter most.
As we look ahead, we do so grounded in what 2025 demonstrated: that open science is built through steady, collective work, with Diversity, Equity, Inclusion, and Accessibility at the center. We believe that sharing what we plan to do has a greater impact if we return to show what we actually did, all the while staying true to the principles that hold this community together.
398488 ChatGPT’ed Monte Carlo exam
https://www.r-bloggers.com/2026/01/chatgpted-monte-carlo-exam/
Wed, 21 Jan 2026 23:26:54 +0000 http://xianblog.wordpress.com/?p=61810
This semester I was teaching a graduate course on Monte Carlo methods at Paris Dauphine and I decided to experiment how helpful ChatGPT would prove in writing the final exam. Given my earlier poor impressions, I did not have great expectations and ended up definitely impressed! In total it took …
[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
This semester I was teaching a graduate course on Monte Carlo methods at Paris Dauphine and I decided to experiment how helpful ChatGPT would prove in writing the final exam. Given my earlier poor impressions, I did not have great expectations and ended up definitely impressed! In total it took me about as long as if I had written the exam by myself, since I went through many iterations, but the outcome was well-suited for my students (or at least for what I expected from my students). The starting point was providing ChatGPT with the articles of Giles on multi-level Monte Carlo and of Jacob et al on unbiased MCMC, and the instruction to turn them into a two-hour exam. Iterations were necessary to break the questions into enough items and to reach the level of mathematical formalism I wanted. Plus add extra questions with R coding. And given the booklet format of the exam, I had to work on the LaTeX formatting (if not on the solution sheet, which spotted a missing assumption in one of my questions). Still a positive experiment I am likely to repeat for the (few) remaining exams I will have to produce!
To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.
398455 Correcting for multiplicity in the ’emmeans’ package
https://www.r-bloggers.com/2026/01/correcting-for-multiplicity-in-the-emmeans-package/
Wed, 21 Jan 2026 00:00:00 +0000 https://www.statforbiology.com/2026/stat_mcp_multivariatet/
In my recent book (see below), on page 166 and earlier, I made the point that, with pairwise comparisons and, more generally, whenever simultaneous statistical tests are performed, it is necessary to provide P-values that account for the familywise …
In my recent book (see below), on page 166 and earlier, I made the point that, with pairwise comparisons and, more generally, whenever simultaneous statistical tests are performed, it is necessary to provide P-values that account for the familywise error rate, i.e. the probability of committing at least one incorrect rejection within the whole family of simultaneous tests (i.e. adjusted P-values). In this respect, it may be useful to recall that, for a single non-significant test, the comparison-wise error rate (E_c) is the probability of a wrong rejection for that single test (based on a non-adjusted P-value), whereas the probability of at least one wrong rejection within a family of (k) comparisons is much higher.
With pairwise comparisons, a single test is usually based on the ratio between a difference and its standard error (a t-test), which is assumed to follow a univariate t-distribution when the null hypothesis is true. When several simultaneous t-tests are performed, the vector of all t-ratios can be assumed to follow a multivariate t-distribution under the hypothesis that the null is true for all simultaneous tests (Bretz et al., 2011). Therefore, adjusted P-values can be obtained by using the probability function of a multivariate t-distribution in place of the simple univariate t-distribution.
As an example, let us reconsider the ‘mixture’ data used in Chapter 9 of the main book. Three herbicide mixtures and an untreated control were tested for their weed-control ability against an important weed in tomato, namely Solanum nigrum. In the code below, we load the data and fit a one-way ANOVA model, using the weight of weed plants per pot as the response variable and the herbicide treatment as the explanatory factor. For the sake of simplicity, we omit the usual checks of the basic assumptions (see the main book). The ANOVA table shows that the treatment effect is significant and, therefore, we proceed to compare treatment means in a pairwise fashion. The P-values shown below do not account for the familywise error rate but only for the comparison-wise error rate; these P-values can be reproduced by using the probability function of a univariate Student’s t-distribution (pt() function in R).
In order to obtain familywise error rates, we should switch from the univariate to the multivariate t-distribution. For example, let’s consider the first t-ratio in the previous Code Box (t = 1.461). We should ask ourselves: “what is the probability of obtaining a t-ratio as extreme as, or more extreme than, 1.461 from a multivariate t-distribution with six dimensions (i.e., the number of simoultaneous tests)?”. In this calculation, we must also consider that the 6 tests are correlated, at least to some extent, because they share some common elements, for example, the same error term in the denominator. In the simplest case (homoscedasticity and balanced data), this correlation is equal to 0.5 for all pairwise comparisons.
In earlier times, when the computing power was limited, calculating probabilities from the multivariate t-distribution was a daunting task. However, for some specific cases (e.g., linear models with homoscedastic and balanced data), adjusted P-values could be obtained by exploiting the distribution of the Studentised Range (the so-called ‘tukey’ method), which is the default option in the contrast() function of the emmeans package, as shown in the following Code box.
tab <- contrast(groupMeans, method = "pairwise")
# tab <- contrast(groupMeans, method = "pairwise", adjust = "tukey") # same as above
tab
## contrast estimate SE df t.ratio p.value
## Metribuzin__348 - Mixture_378 4.05 2.77 12 1.461 0.4885
## Metribuzin__348 - Rimsulfuron_30 -7.68 2.77 12 -2.774 0.0698
## Metribuzin__348 - Unweeded -17.60 2.77 12 -6.352 0.0002
## Mixture_378 - Rimsulfuron_30 -11.73 2.77 12 -4.235 0.0055
## Mixture_378 - Unweeded -21.64 2.77 12 -7.813 <.0001
## Rimsulfuron_30 - Unweeded -9.91 2.77 12 -3.578 0.0173
##
## P value adjustment: tukey method for comparing a family of 4 estimates
# The P-value is obtained from the Studentised Range Distribution (two-tails test)
abst <- abs(as.data.frame(tab)$t.ratio)
ptukey(sqrt(2) * abst, 4, 12, lower.tail = FALSE)
## [1] 4.884620e-01 6.981178e-02 1.853807e-04 5.501451e-03 2.473776e-05
## [6] 1.725725e-02
This simple method yields exact familywise error rates with balanced data—which represent the vast majority of designed field experiments in agriculture—and performs reasonably well in the presence of small degrees of imbalance. Within the framework of traditional multiple-comparison testing procedures, the approach described above leads to the same results as Tukey’s HSD for balanced data and the Tukey–Kramer test for unbalanced data.
More recently, it has become possible to directly calculate probabilities from the multivariate t-distribution, which is particularly convenient because it provides a more general approach to obtaining familywise error rates. This distribution is implemented in the ‘mvtnorm’ package through the pmvt() function. To perform the calculation, we must specify, for each dimension, the interval over which the probability is to be computed (in this case, for the first t-ratio, the interval is (pm 1.461081)), the number of degrees of freedom (12), and the correlation matrix of the linear combinations, which can be directly retrieved from the ‘emmGrid’ object. The code below illustrates these calculations. The quantity ‘plev’ represents the probability of sampling within the interval (i.e. none of the six null hypotheses is wrongly rejected), whereas the familywise error rate corresponds to the probability of sampling outside the interval (i.e. at least one null hypothesis is wrongly rejected), which is obtained by subtraction.
The above function employs numerical integration methods and is based on simulation; consequently, the results are not fully reproducible. However, it is easy to see that these results are asymptotically equivalent to those obtained with the Tukey adjustment method shown above. Owing to this intrinsic complexity, the use of the adjust = "mvt" argument is not recommended for pairwise comparisons in balanced experiments, whereas it may prove useful in other situations, for example in the presence of strongly unbalanced data.
Thanks for reading—and don’t forget to check out my new book below!
Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) Send comments to: andrea.onofri@unipg.it
References
Bretz, F., Hothorn, T., Westfall, P., 2011. Multiple comparisons using R. CRC Press, Boca Raton, FL.
398449 Why Submit to AI in Production: Speaking as a Tool for Better Work
https://www.r-bloggers.com/2026/01/why-submit-to-ai-in-production-speaking-as-a-tool-for-better-work/
Tue, 20 Jan 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/why-submit-ai-in-production/
We’re accepting abstracts for AI in Production until 23rd January. The conference takes place on 4th–5th June 2026 in Newcastle, with talks on Friday 5th across two streams: one focused on engineering and production systems, the other on ma…
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
We’re accepting abstracts for AI in Production until 23rd January. The conference takes place on 4th–5th June 2026 in Newcastle, with talks on Friday 5th across two streams: one focused on engineering and production systems, the other on machine learning and model development.
We often hear: “My work isn’t ready to talk about yet” or “I’m not sure anyone would be interested.” We want to address that hesitation directly.
Speaking at a conference isn’t primarily about promoting yourself or your organisation.
It’s a practical tool that helps you do better work. Preparing and delivering a talk forces useful reflection, invites feedback from people facing similar challenges, and turns knowledge that lives only in your head into something your team can reuse.
If you’re wondering whether your work qualifies: internal systems count, work in progress counts, partial success counts.
When you sit down to explain a technical choice to an audience, you have to answer questions you might have glossed over at the time: Why did we build it this way? What constraints shaped our approach? What would we do differently now?
This isn’t about justifying your decisions to others. It’s about understanding them yourself. The process of turning a production system into a coherent narrative forces you to see patterns you were too close to notice while building it. You identify what worked, what didn’t, and why. That clarity is valuable whether or not you ever give the talk.
Many practitioners find that writing an abstract or outline reveals gaps in their thinking. A deployment strategy that seemed obvious in context becomes harder to explain without it. A monitoring approach that felt pragmatic reveals underlying assumptions. This friction is useful. It means you’re learning something about your own work.
Speaking Invites Useful Feedback
The audience at AI in Production will broadly fall across two streams: engineering (building, shipping, maintaining, and scaling systems) and machine learning (model development, evaluation, and applied ML).
Whether you’re working on infrastructure and deployment or on training pipelines and model behaviour, you’ll be in a room with people facing similar constraints: limited resources, shifting requirements, imperfect data, and operational pressures.
When you share what you’ve tried, you get feedback from people who understand the context. Someone has solved a similar problem differently. Someone has run into the same failure mode. Someone asks a question that makes you reconsider an assumption.
This kind of peer feedback is hard to get otherwise. Your team is too close to the work. Online discussions lack context. A conference talk puts your approach in front of people who can offer informed perspectives without having to understand your entire stack or organisational structure first.
Talks Help Share Responsibility and Knowledge
In many teams, knowledge about production systems sits with one or two people. They know why certain decisions were made, where the edge cases are, and how to interpret the monitoring dashboards. That concentration of knowledge creates risk.
Preparing a talk is a forcing function for documentation. To explain your system to strangers, you have to articulate what’s currently tacit. That articulation becomes something your team can use: onboarding material, decision records, runbooks.
Speaking also distributes responsibility. When you present work publicly, it stops being just yours. Your team shares ownership of the ideas. Others can critique, extend, or maintain them. This is particularly valuable for platform teams or infrastructure work, where the people who built something may not be the ones operating it six months later.
Turning Tacit Knowledge into Reusable Material
Much of what you know about your production systems isn’t written down. You understand the failure modes, the workarounds, and the operational quirks. You know which metrics matter and which are noise. You remember why you made certain tradeoffs.
A conference talk is an excuse to capture that knowledge. The slides become a reference. The abstract becomes a design document. The Q&A reveals what wasn’t clear and needs better documentation.
Even if the talk itself is ephemeral, the process of preparing it leaves artefacts. You’ve already done the hard work of running the system. Speaking about it turns that experience into something others can learn from, and you can build on.
Your Work Is Worth Sharing
If you’re maintaining AI systems in production, you’re solving problems worth talking about. Making models reliable under load, keeping training pipelines maintainable, monitoring behaviour when ground truth is delayed or absent, and managing technical debt while shipping features.
These are the problems practitioners face every day. Your approach won’t be perfect, and that’s the point. Talks about work in progress, about things that didn’t work, about compromises made under constraint are often more useful than polished success stories.
We’re looking for honest accounts of how people are actually building and operating AI systems. That might fit the engineering stream (deployment, infrastructure, monitoring, scaling) or the machine learning stream (training, evaluation, model behaviour, responsible data use). If you’re doing work in either area, you have something to contribute.
Submit an Abstract
The deadline is 23rd January. You’ll need a title and an abstract of up to 250 words. You don’t need a perfect story or a finished project. You need a problem you’ve worked on, some approaches you’ve tried, and some lessons you’ve learned.
Think about what would be useful for someone six months behind you on a similar path. Think about what you wish someone had told you before you started. Think about the conversation you’d want to have with peers who understand the constraints you’re working under.
If you’re not sure where to start, consider writing about one decision that shaped your system, one assumption that turned out to be wrong, or one constraint that changed your design. Good abstracts often start with a specific moment or choice rather than a broad overview.
Ready to submit? The deadline is 23rd January. Share one decision, one lesson, or one constraint from your production work: https://jumpingrivers.com/ai-production/
If you have questions about whether your work fits the conference, reach out at events@jumpingrivers.com. We’re here to help make this easier.
For updates and revisions to this article, see the original post
To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.
398424 Bioinformatics Analysis on Posit Connect Cloud with freeCount
https://www.r-bloggers.com/2026/01/bioinformatics-analysis-on-posit-connect-cloud-with-freecount/
Tue, 20 Jan 2026 04:52:28 +0000 http://morphoscape.wordpress.com/?p=3181
The easiest way to use the freeCount R Shiny applications online is through Posit Connect Cloud, which is an online platform that simplifies the deployment of data applications and documents.
Continue reading: Bioinformatics Analysis on Posit Connect Cloud with freeCount]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on R – Myscape, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Overview
The easiest way to use the freeCount R Shiny applications online is through Posit Connect Cloud, which is an online platform that simplifies the deployment of data applications and documents.
freeCount
The freeCount analysis framework provides a modular set of tools and tutorials for a structured approach to biological count data analysis. Users are guided through common data assessment, processing and analysis approaches.
The different analysis tools currently available include:
Differential Expression (DE) Analysis – DA
Network Analysis – NA
Functional Analysis – FA
Set Operations – SO
Steps
The following steps will walk you through how to run the freeCount apps online using Posit Connect Cloud.
Select the app that you want to run and click the name or image to open.
Wait…
Wait for the project to deploy in your Posit Cloud workspace.
Done!
Now you are able to perform the selected analysis.
Analysis Tutorials
The freeCount apps provide a set of common tools for analyzing biological data, including differential expression and network analysis. We have tutorials available to guide users through a structured analysis approach:
Functional analysis of DE results is useful for determining the functions of DE genes. Genes can have multiple functional annotations, so we need to determine which ones are important.
Weighted gene co-expression network analysis (WGCNA) is used to investigate the function of genes at the system-level. In a network analysis genes with similar patterns of expression are grouped together into modules.
Functional analysis of network results is useful for determining the functions of a set of interesting genes. These gene sets can be lists of genes produced from different analysis, including WGCNA.
Set operations using Venn diagrams are generally very useful for comparing lists of things. Set operations are also a good way to identify unique or shared genes across sets of analysis results.
To leave a comment for the author, please follow the link and comment on their blog: R – Myscape.
398421 Introducing distionary for Building and Probing Distributions
https://www.r-bloggers.com/2026/01/introducing-distionary-for-building-and-probing-distributions/
Tue, 20 Jan 2026 00:00:00 +0000 https://ropensci.org/blog/2026/01/20/introducing-distionary/
After passing through rOpenSci peer review, the distionary package is now newly available on CRAN. It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail. These d…
After passing through rOpenSci peer review, the distionary package is now newly available on CRAN.
It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.
These distributions form the building blocks that piece together advanced statistical models with the wider probaverse ecosystem, which is built to release modelers from low-level coding so production pipelines stay human-friendly.
Right now, the other probaverse packages are distplyr, allowing you to morph distributions into new forms, and famish, allowing you to tune distributions to data.
Developed with risk analysis use cases like climate and insurance in mind, the same tools translate smoothly to simulations, teaching, and other applied settings.
This post highlights the top 3 features of this youngest version of distionary.
Let’s start by loading the package.
library(distionary)
Feature 1: more than just Base R distributions
Of course, all the Base R distributions are available in distionary.
Here’s everyone’s favourite Normal distribution.
dst_norm(0, 1)
Normal distribution (continuous)
--Parameters--
mean sd
0 1
plot(dst_norm(0, 1))
And good old Poisson.
dst_pois(3)
Poisson distribution (discrete)
--Parameters--
lambda
3
plot(dst_pois(3))
But there are additional game-changing distributions included, too.
A Null distribution, which always evaluates to NA.
When you’re running an algorithm that encounters an issue, you can return a Null distribution instead of throwing an error.
Even downstream evaluation steps won’t error out because the code still sees a distribution rather than a bare NA or NULL.
# Make a Null distribution.
null <- dst_null()
# Null distributions always evaluate to NA.
eval_quantile(null, at = c(0.25, 0.5, 0.75))
[1] NA NA NA
mean(null)
[1] NA
Empirical distributions, where the data are the distribution.
These respect observed behaviour without forcing a specific shape, and are also commonly used as a benchmark for comparison against other models.
Here’s an example using the Ozone concentration from the airquality dataset that comes loaded with R.
# Empirical distribution of Ozone from the `airquality` dataset.
emp <- dst_empirical(airquality$Ozone, na_action_y = "drop")
# Inspect
print(emp, n = 5)
Finite distribution (discrete)
--Parameters--
# A tibble: 67 × 2
outcomes probs
<int> <dbl>
1 1 0.00862
2 4 0.00862
3 6 0.00862
4 7 0.0259
5 8 0.00862
# ℹ 62 more rows
Compare its cumulative distribution function (CDF) to that of a Gamma distribution fitted to the Ozone levels, borrowing the probaverse’s famish package for the fitting task.
# Fit a Gamma distribution to Ozone using the famish package.
library(famish)
gamma <- fit_dst_gamma(airquality$Ozone, na_action = "drop")
# Plot the cumulative distribution functions (CDFs) together.
plot(emp, "cdf", n = 1000, xlab = "Ozone Levels (ppb)")
plot(gamma, "cdf", add = TRUE, col = "red")
legend(
"bottomright",
legend = c("Empirical", "Fitted Gamma"),
col = c("black", "red"),
lty = 1
)
These textbook distributions become much more useful once they become building blocks for building up a system.
For example, they could form predictive distributions in a machine learning context, or be related to other variables.
This is what the probaverse seeks to make possible.
Feature 2: friendly towards tidy tabular workflows
First, load the tidyverse to activate tidy tabular workflows.
And yes, probaverse is named after the tidyverse because it aims to be a “tidyverse for probability”.
library(tidyverse)
You can safely ignore this next chunk unless you want to see how I’m wrangling some financial data for you.
# Wrangle the stocks data frame using tidyverse.
stocks <- as_tibble(EuStockMarkets) |>
mutate(across(everything(), (x) 100 * (1 - x / lag(x)))) |>
drop_na()
The stocks data I’ve wrangled is a table of daily percent loss for four major European stock indices.
The dates don’t matter for this example, so they’ve been omitted.
First, let’s focus on the DAX stock index.
Fit an empirical distribution like last time (notice I’m using a data mask1 in dst_empirical() this time).
# Fit an empirical distribution to the DAX stock index.
dax <- dst_empirical(DAX, data = stocks)
# Inspect the CDF.
plot(dax, xlab = "Daily Loss (%)")
You can easily calculate some standard quantiles in tabular format so that the inputs are placed alongside the calculated outputs: just use the enframe_ prefix instead of eval_ as we did above with the Null distribution.
Or, more to the point here – and appealing to probaverse’s soft spot for risk-focused work – you can calculate return levels (also known as “Value at Risk” in financial applications) for specific return periods.
If you don’t know what these are, they are just fancy names for quantiles.
The tabular output becomes even more powerful when inserted into a table of models, because it facilitates comparisons and trends.
To demonstrate, build a model for each stock. First, lengthen the data for this task.
Build a model for each stock using a group_by + summarise workflow from the tidyverse (please excuse the current need to wrap the distribution in list()). Notice that distributions become table entries, indicated here by their class <dst>.
# Create an Empirical distribution for each stock.
models <- stocks2 |>
group_by(stock) |>
summarise(model = list(dst_empirical(daily_loss_pct)))
# Inspect
models
# A tibble: 4 × 2
stock model
<chr> <list>
1 CAC <dst>
2 DAX <dst>
3 FTSE <dst>
4 SMI <dst>
Now you can use a tidyverse workflow to calculate tables of quantiles for each model, and expand them.
In fact, this workflow is common enough that I’m considering adding a dedicated verb for it.
The result is a tidy dataset that’s ready for most analyses.
For example, you can easily plot a comparison of the return levels of each stock.
I make these plots all the time to facilitate risk-informed decision-making.
You can create your own distributions with distionary by specifying only a minimal set of properties; all other properties are derived automatically and can be retrieved when needed.
Let’s say you need an Inverse Gamma distribution but it’s not available in distionary.
Currently, distionary assumes you’ll at least provide the density and CDF; you could retrieve these from the extraDistr package (functions dinvgamma() and pinvgamma()).
Plug them into distionary’s distribution() function and enjoy access to a variety of properties you didn’t specify, like the mean, variance, skewness, and hazard function.
# Make an Inverse Gamma distribution (minimal example).
ig <- distribution(
density = function(x) extraDistr::dinvgamma(x, alpha = 5, beta = 20),
cdf = function(x) extraDistr::pinvgamma(x, alpha = 5, beta = 20),
.vtype = "continuous",
)
# Calculate anything.
mean(ig)
[1] 5
variance(ig)
[1] 8.333333
skewness(ig)
[1] 3.464085
plot(ig, "hazard", to = 20, n = 1000, xlab = "Outcome")
You might also consider giving the distribution a .name – it pays off when you’re juggling multiple models.
Adding .parameters provides additional specificity to the distribution .name, but are otherwise not yet used for functional purposes.
Here is a more complete implementation of the Inverse Gamma distribution, this time implemented as a function of the two parameters.
Notice I also check that the parameters are positive (cheers to the checkmate package).
By the way, this feature – being able to inspect other distribution properties even when they are not specified – is great for learning about probability.
That’s because you can see the many ways distributions can be represented, not just by the usual density or probability mass functions seen in textbooks.
This feature also allows for extensibility of the probaverse.
For example, the probaverse’s distplyr package creates mixture distributions, which do not have an explicit formula for the quantile function.
However, this is not problematic – the distribution can still be defined, and distionary will figure out what the quantiles are.
What’s to come?
Currently, the distionary package provides key functionality to define and evaluate distribution objects. Future goals include:
Broader coverage. Extending beyond univariate continuous distributions so mixed discrete/continuous and multivariate use cases feel first-class.
Risk-native metrics. Making cost functions, tail expectations, and other decision metrics evaluate more naturally.
Workflow ergonomics. Relaxing the requirement to supply density/CDF pairs and adding verbs that streamline common mutate–map–unnest patterns.
If this excites you, join the conversation by opening an issue or contributing.
398433 Pharmaverse and Containers
https://www.r-bloggers.com/2026/01/pharmaverse-and-containers/
Sun, 18 Jan 2026 00:00:00 +0000 https://pharmaverse.github.io/blog/posts/2026-01-18_pharmaverse_containers/pharmaverse_and__containers.html
Streamlining Our Pharmaverse Blog: Reducing Publishing Time with containers As an active contributor to the pharmaverse blog, I’ve always appreciated the opportunity to share new insights and tools with our community. The pharmaverse blog has…
[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Streamlining Our Pharmaverse Blog: Reducing Publishing Time with containers
As an active contributor to the pharmaverse blog, I’ve always appreciated the opportunity to share new insights and tools with our community. The pharmaverse blog has some interesting features for publishing. While the blog publishing process has been effective, I was wondering if there was a way to optimize our workflows. Currently, the CI/CD pipeline for publishing the blog typically took about 17 minutes to deploy a new post. Containers are always batted around as solutions. Unfortunately, I was unsure how to create a new container/image to meet my needs and relied on some straight install.packages() in the CI/CD pipeline. Luckily, I crossed paths with the fabulous Maciej Nasinski and we built a specific container for the publishing process for the pharmaverse blog, which allowed for a notable reduction in publishing time.
Below I will discuss how the pharmaverse container image has improved our blog’s publishing workflow, bringing our deployment time down to approximately 5 minutes. We are also interested in feedback on potential other uses of this container (like devcontainers) or building additional containers for certain purposes. For those interested, we would be happy to provide a tutorial on containers or get in touch if you have ideas or want to get involved!
The Previous Approach: Package Installation overhead
Our prior CI/CD (GitHub Actions) workflow for building and deploying the pharmaverse blog, while comprehensive, included a time-consuming step. It used the a straight “Install dependencies” step from the r-lib actions, which necessitated installing a range of pharmaverse specific R packages during each run. As we added more packages to the pharmaverse to the blog, this became really cumbersome!
The relevant section of our old build-deploy job highlighted this:
build-deploy:
# ... other configurations ...
steps:
# ... checkout and setup Quarto ...
- name: Setup R
uses: r-lib/actions/setup-r@v2
with:
use-public-rspm: true
- name: Install dependencies
uses: r-lib/actions/setup-r-dependencies@v2
with:
packages: |
jsonlite
tidyverse
spelling
janitor
diffdf
admiral
admiralonco
# ... and many, many more packages ...
haven
# ... other steps like install tinytex and publish ...
This “Install dependencies” step, involving a substantial list of pharmaverse packages, was a primary contributor to the 17-minute execution time. Each workflow run involved downloading and configuring these packages, extending the overall deployment duration.
Adopting the pharmaverse container image: An Efficient Alternative
The solution to this challenge came with the introduction of the pharmaverse container image: ghcr.io/pharmaverse/docker_pharmaverse:4.5.1. This container image was specifically designed for pharmaceutical data analysis, featuring over 40 essential pharmaverse packages pre-installed. These packages cover various functionalities, including CDISC ADaM/SDTM, clinical trial reporting, and regulatory submissions. Built upon the rocker/tidyverse image and incorporating R 4.5.1, it provides a pre-configured environment.
By integrating this image into our CI/CD pipeline (GitHub Actions), we could bypass the extensive package installation phase.
Here’s an overview of our updated build-deploy job:
build-deploy:
needs: Update-post-dates
runs-on: ubuntu-latest
container:
image: "ghcr.io/pharmaverse/docker_pharmaverse:4.5.1"
permissions:
contents: write
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Configure Git safe directory
run: git config --global --add safe.directory /__w/blog/blog
- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
with:
version: 1.9.12
# NOTE: Explicit R package installation is no longer required here.
- name: Install tinytex
run: quarto install tool tinytex
- name: Mark repo directory as safe
run: git config --global --add safe.directory /__w/blog/blog
- name: Publish
uses: quarto-dev/quarto-actions/publish@v2
with:
path: . # Path to your .qmd file
target: gh-pages # Target branch for GitHub Pages
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The outcome of this change has been a consistent reduction in publishing time. Our blog now publishes in approximately 5 minutes. Yay!
Wider Applications of the Pharmaverse Container Image
While this update directly benefits our blog’s publishing, the pharmaverse container image offers broader utility. It is a valuable resource for:
Pharmaceutical data analysis: Supporting tasks related to ADaM, SDTM, and TLFs.
Clinical trial programming: Assisting with regulatory submissions.
Reproducible research: Providing a consistent and verifiable environment.
Training & education: Offering a ready-to-use setup for educational purposes.
CI/CD pipelines: Enhancing automated testing and reporting, as observed in our own workflow.
Collaborative development: Facilitating consistent environments across teams.
Example: Running Reproducible RStudio Locally
To spin up a local RStudio instance using the pharmaverse image, you can use a container “one-liner.” This command condenses the entire configuration – networking, storage, and background execution – into a single execution string. It bridges your local folder to the container, allowing you to edit files “on the fly” without installing R or packages on your actual machine.
Choose the command that matches your operating system:
1. Linux & Intel Macs (Standard)
For standard architecture, we set a simple password (rstudio) and mount the current directory.
Note: We do not yet natively support ARM64 architecture (e.g., Apple Silicon). The command below uses emulation (--platform linux/amd64) to run the image. We also disable authentication because the slower emulation speed can sometimes cause password setup to time out.
**--rm & -d**: Runs the container in the background (detached) and automatically removes it when stopped, keeping your machine clean.
-v "$(pwd)":...: Takes your Present Working Directory and mounts it inside the container. Any file you save in RStudio’s “project” folder is instantly saved to your local computer.
--platform linux/amd64(Mac only): Forces your computer to emulate the Intel architecture required by the image.
-e DISABLE_AUTH=true(Mac only): Bypasses the login screen to ensure immediate access despite slower emulation speeds.
Accessing RStudio
Once the command is running:
Open your browser to http://localhost:8787.
Linux/Intel: Log in with user rstudio and password rstudio.
Apple Silicon: You will be logged in automatically.
You will see your local files mapped to the project folder in the Files pane, ready for analysis.
Engaging with the Pharmaverse Community
The pharmaverse container image mage represents a collaborative effort within the life sciences industry to provide open-source tools. For those working with pharmaverse packages and R in development, research, or CI/CD contexts, exploring this image may offer practical advantages in efficiency and reproducibility.
@online{straub_and_maciej_nasinski2026,
author = {Straub and Maciej Nasinski, Ben},
title = {Pharmaverse and {Containers}},
date = {2026-01-18},
url = {https://pharmaverse.github.io/blog/posts/2026-01-18_pharmaverse_containers/pharmaverse_and__containers.html},
langid = {en}
}
398381 Volleyball Analytics with R: The Complete Guide to Match Data, Sideout Efficiency, Serve Pressure, Heatmaps, and Predictive Models
https://www.r-bloggers.com/2026/01/volleyball-analytics-with-r-the-complete-guide-to-match-data-sideout-efficiency-serve-pressure-heatmaps-and-predictive-models/
Sat, 17 Jan 2026 18:38:12 +0000 https://rprogrammingbooks.com/?p=2407
Volleyball Analytics Volleyball Analytics with R: A Practical, End-to-End Playbook Build a full volleyball analytics workflow in R: data collection, cleaning, scouting reports, skill KPIs, rotation/lineup analysis, sideout & transition, serve/receive, visualization, dashboards, and predictive modeling. Table of Contents Why Volleyball Analytics (and Why R) Volleyball Data Model: Events, …
[This article was first published on Blog – R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Volleyball Analytics
Volleyball Analytics with R: A Practical, End-to-End Playbook
Build a full volleyball analytics workflow in R: data collection, cleaning, scouting reports, skill KPIs, rotation/lineup analysis, sideout & transition, serve/receive, visualization, dashboards, and predictive modeling.
Why Volleyball Analytics (and Why R)
Volleyball is a sequence of discrete events (serve, pass, set, attack, block, dig) organized into
rallies and phases (sideout vs. transition). This structure makes it ideal for:
event-based analytics, rotation analysis, scouting tendencies,
expected efficiency modeling, and win probability.
R excels at this because of tidy data workflows (dplyr/tidyr), great visualization (ggplot2),
modern modeling (tidymodels, brms), and easy reporting (Quarto/R Markdown).
If you want a repeatable volleyball analytics pipeline for your club or team, R is a perfect fit.
Keywords you should care about
Sideout % (SO%), Break Point % (BP%), Transition Efficiency
Serve Pressure, Passing Rating, First Ball Sideout
library(tidyverse)
library(lubridate)
event_schema <- tibble::tibble(
match_id = character(),
datetime = ymd_hms(character()),
set_no = integer(),
rally_id = integer(),
home_team = character(),
away_team = character(),
team = character(), # team performing the action
opponent = character(), # opponent of team
player = character(),
jersey = integer(),
skill = factor(levels = c("serve","pass","set","attack","block","dig","freeball")),
evaluation = character(), # e.g., "error","ace","perfect","positive","negative","kill","blocked","dig"
start_zone = integer(), # 1..6 (or 1..9 depending system)
end_zone = integer(),
rotation = integer(), # 1..6
phase = factor(levels = c("sideout","transition")), # derived later
score_team = integer(), # score for team at time of event
score_opp = integer(),
point_won_by = character(), # which team won rally point
stringsAsFactors = FALSE
)
glimpse(event_schema)
You can extend this schema with positional labels (OH, MB, OPP, S, L),
contact order (1st/2nd/3rd), attack tempo, block touches, etc.
Data Sources: Manual Logs, Video Tags, DataVolley-Style Exports
Volleyball data typically arrives as: (1) manual spreadsheets, (2) video tagging exports, or (3) scouting software exports.
Regardless of source, your R pipeline should:
metric_dictionary <- tribble(
~metric, ~definition,
"SO%", "Sideout percentage: points won when receiving serve / total receive opportunities",
"BP%", "Break point percentage: points won when serving / total serving opportunities",
"Kill%", "Kills / attack attempts",
"Eff%", "(Kills - Errors) / attempts",
"Ace%", "Aces / total serves",
"Err%", "Serve errors / total serves"
)
metric_dictionary
Import & Clean Volleyball Event Data
Most problems in volleyball analytics are data quality problems: inconsistent team names,
missing rally keys, duplicated rows, weird evaluation labels, or mixed zone definitions.
A common approach: identify which team served in the rally. If a team receives serve, that is a sideout opportunity.
If a team is serving, that is a break point opportunity. You can derive phase per team within each rally.
Volleyball KPIs are best computed from event tables with clear skill and evaluation codes.
Below is a practical KPI set that works for scouting and performance analysis.
Sideout, Break Point, Transition & Rally Phase Analytics
If you only measure one thing in volleyball, measure sideout efficiency.
Most matches are decided by who wins more sideout points and who generates more break points.
In R, you can compute SO% and BP% directly from rally winners and serving team.
R code: First-ball sideout (FBSO) using pass quality
A classic volleyball KPI: do we sideout on the first attack after serve receive? Add pass quality segmentation:
perfect/positive/negative passes and their first-ball sideout probability.
Modern serve analytics combines zone targeting, pass degradation, and point outcomes.
Even if you don’t track ball coordinates, zones 1–6 (or 1–9) are enough for powerful insights.
R code: serve target heatmap by end_zone
library(ggplot2)
serve_zones <- events %>%
filter(skill == "serve") %>%
count(team, end_zone, name = "serves") %>%
group_by(team) %>%
mutate(pct = serves / sum(serves)) %>%
ungroup()
ggplot(serve_zones, aes(x = factor(end_zone), y = pct)) +
geom_col() +
facet_wrap(~ team) +
labs(
title = "Serve Target Distribution by Zone",
x = "End Zone (Serve Target)",
y = "Share of Serves"
)
R code: serve pressure proxy via opponent pass score
With this table, you can say: “Serving zone 5 creates low passes 38% of the time and increases break-point rate.”
Attack Shot Charts, Zones, Tendencies & Scouting
Attack analytics becomes powerful when you connect attack zone, target area, block context,
and outcome. Even simple zone models can guide scouting: “Their opposite hits sharp to zone 1 on bad passes.”
R code: attack tendency table by start_zone → end_zone
Once your event model is clean, you can move beyond descriptive KPIs into modeling:
expected sideout (xSO), expected point (xP), win probability,
and strategy simulation.
R code: expected sideout (logistic regression baseline)
R code: simple set-level win probability from score differential
# If you have event-level score columns, you can build a win probability model.
# Here we illustrate a simple logistic model from score differential and set number.
wp_df <- events %>%
filter(!is.na(score_team), !is.na(score_opp)) %>%
mutate(score_diff = score_team - score_opp) %>%
group_by(match_id, set_no, rally_id) %>%
summarise(
team = first(team),
score_diff = first(score_diff),
point_won_by = first(na.omit(point_won_by)),
.groups = "drop"
) %>%
mutate(won_point = point_won_by == team)
wp_fit <- glm(won_point ~ score_diff + factor(set_no), data = wp_df, family = binomial())
wp_df <- wp_df %>%
mutate(win_prob_point = predict(wp_fit, type = "response"))
wp_fit %>% broom::tidy()
R code: Elo ratings for volleyball teams
# Minimal Elo example (team-level). You can replace with your season match table.
matches <- tibble(
match_id = c("m1","m2","m3"),
date = as.Date(c("2025-09-01","2025-09-05","2025-09-10")),
home = c("Team A","Team B","Team A"),
away = c("Team B","Team C","Team C"),
winner = c("Team A","Team C","Team A")
)
elo_update <- function(r_home, r_away, home_won, k = 20) {
p_home <- 1 / (1 + 10^((r_away - r_home)/400))
s_home <- ifelse(home_won, 1, 0)
r_home_new <- r_home + k * (s_home - p_home)
r_away_new <- r_away + k * ((1 - s_home) - (1 - p_home))
list(home = r_home_new, away = r_away_new, p_home = p_home)
}
teams <- sort(unique(c(matches$home, matches$away)))
ratings <- setNames(rep(1500, length(teams)), teams)
elo_log <- vector("list", nrow(matches))
for (i in seq_len(nrow(matches))) {
m <- matches[i,]
rH <- ratings[[m$home]]
rA <- ratings[[m$away]]
upd <- elo_update(rH, rA, home_won = (m$winner == m$home))
ratings[[m$home]] <- upd$home
ratings[[m$away]] <- upd$away
elo_log[[i]] <- tibble(match_id = m$match_id, p_home = upd$p_home,
home = m$home, away = m$away,
winner = m$winner,
r_home_pre = rH, r_away_pre = rA,
r_home_post = upd$home, r_away_post = upd$away)
}
bind_rows(elo_log) %>% arrange(match_id)
tibble(team = names(ratings), elo = as.numeric(ratings)) %>% arrange(desc(elo))
R code: Markov chain model for rally outcomes (conceptual starter)
A Markov model represents rally states like: Serve → Pass → Set → Attack → (Point/Continuation).
Below is a lightweight starting template to estimate transition probabilities from event sequences.
If you want production-grade modeling in R, use tidymodels:
pipelines, cross-validation, recipes, metrics, and model tuning.
Here is an end-to-end example predicting sideout success using pass score + serve zone.
R code: add player random effects with mixed models (glmm)
# For player/team variation, you can use lme4 (not tidymodels-native).
install.packages("lme4")
library(lme4)
# Example: include receiving_team as a random intercept
xso_glmm <- glmer(
sideout_success ~ pass_score + factor(serve_zone) + (1 | receiving_team),
data = rally_model_df,
family = binomial()
)
summary(xso_glmm)
Bayesian Volleyball Analytics in R
Bayesian models are ideal when you want uncertainty, shrinkage, and better inference with small samples.
In volleyball scouting, sample sizes can be tiny (a few matches), so Bayesian partial pooling is often a win.
R code: Bayesian xSO with brms
# Bayesian logistic regression with partial pooling by receiving team
install.packages("brms")
library(brms)
bayes_fit <- brm(
sideout_success ~ pass_score + factor(serve_zone) + (1 | receiving_team),
data = rally_model_df,
family = bernoulli(),
chains = 2, cores = 2, iter = 1500,
seed = 2026
)
summary(bayes_fit)
posterior_summary(bayes_fit)
With brms, you can compute posterior distributions of SO% by team, compare strategies, and avoid overreacting to noise.
Visualization: ggplot2 Templates for Volleyball
Volleyball visualizations should be coach-friendly, quick to read, and tied to decisions:
serve target, pass quality, rotation weaknesses, attack tendencies, and pressure points.
R code: SO% and BP% report chart
so_bp_wide <- so_bp %>%
select(team, metric, pct) %>%
pivot_wider(names_from = metric, values_from = pct)
so_bp_long <- so_bp %>%
ggplot(aes(x = team, y = pct, fill = metric)) +
geom_col(position = "dodge") +
coord_flip() +
labs(title = "Sideout % and Break Point % by Team", x = NULL, y = "Rate")
so_bp_long
R code: rotation heatmap (SO% by rotation)
rot_plot_df <- rotation_efficiency %>%
mutate(receive_rotation = factor(receive_rotation, levels = 1:6))
ggplot(rot_plot_df, aes(x = receive_rotation, y = receiving_team, fill = so_pct)) +
geom_tile() +
labs(title = "Rotation Sideout Heatmap", x = "Rotation (Receiving)", y = "Team")
A Shiny scouting app can deliver instant insights for coaches: opponent serve targets, rotation weaknesses, attacker tendencies,
and key matchups. Below is a compact Shiny template you can expand.
One of the best uses of R in volleyball: automated weekly scouting reports. Generate:
HTML match report, PDF coaching packet, and tables/figures for staff.
R code: Quarto report skeleton
# Create a Quarto (.qmd) file like reports/match_report.qmd
# Then render in R:
# quarto::quarto_render("reports/match_report.qmd")
# Example render call:
quarto::quarto_render(
input = "reports/match_report.qmd",
execute_params = list(match_id = "match_001")
)
If you want a structured, practical resource that goes deeper into volleyball analytics workflows, R code patterns,
scouting/reporting, and modeling concepts, check out this book:
It’s a great companion if you’re building a complete R-based analytics stack for a club, federation, or collegiate program.
FAQ
What’s the best single metric in volleyball?
If you only track one KPI: Sideout %. It correlates strongly with winning because it reflects serve-receive
stability and first-ball offense conversion.
How do I handle different coding systems?
Create a mapping layer (like eval_map) and convert raw labels into a standardized internal vocabulary.
The rest of your pipeline should never depend on raw coding strings.
Can I do volleyball analytics without coordinates?
Yes. Zone-based analytics (1–6 or 1–9) plus pass quality and outcome are enough for rotation analysis, serve targeting,
and basic predictive modeling.
What should I build first?
Start with: import + clean → SO% / BP% → pass + serve dashboards → rotation sideout → attack efficiency by pass quality.
Once those are stable, add modeling.
398365 Setting Up A Cluster of Tiny PCs For Parallel Computing – A Note To Myself
https://www.r-bloggers.com/2026/01/setting-up-a-cluster-of-tiny-pcs-for-parallel-computing-a-note-to-myself/
Fri, 16 Jan 2026 00:00:00 +0000 https://www.kenkoonwong.com/blog/parallel-computing/
Enjoyed learning the process of setting up a cluster of tiny PCs for parallel computing. A note to myself on installing Ubuntu, passwordless SSH, automating package installation across nodes, distributing R simulations, and comparing CV5 vs CV10 performance. Fun project!
[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Enjoyed learning the process of setting up a cluster of tiny PCs for parallel computing. A note to myself on installing Ubuntu, passwordless SSH, automating package installation across nodes, distributing R simulations, and comparing CV5 vs CV10 performance. Fun project!
Motivations
Part of something I want to learn this year is getting a little more into parallel computing. How we can distribute simulation computations across different devices. Lately, we have more reasons to do this because quite a few of our simulations require long running computation and leaving my laptop running overnight or several days is just not a good use it. We have also tried cloud computing as well and without knowing how those distributed cores are, well, distributed, it’s hard for me to conceptualize how these are done and what else we could optimize. Hence, what is a better way of doing it on our own! Sit tight, this is going to be a bumpy one. Let’s go!
When starting Lenovo up, press F12 continuously until it shows an option to boot from USB. If F12 does not work, reboot and press F1 to BIOS. Go to Startup Tab, change CSM Support to Enabled. Then set Primary Boot Priority to USB by moving priority to first. Then F10 to save configuration and exit. It will then reboot to USB.
Make sure it’s connected to internet via LAN for smoother installation.
Follow the instructions to install Ubuntu, setting username, password etc. Then reboot.
Make sure to remove USB drive, if you didn’t it’ll remind you. Et voila!
The installations were very quick, compared to the other OS I’ve installed in the past. Very smooth as well. I thoroughly enjoyed seeting these up.
Align and Fix IPs
For organizational purpose, make sure you go to your router setting and set your computer clusters to convenient IPs such as 192.168.1.101, 192.168.1.102, 192.168.1.103 etc. You may have to reboot your computer clusters after changing it on your router.
Passwordless SSH
Next, you want to set up passwordless SSH. This is crucial for R to work!
it will prompt you to enter your password, then after that you won’t need a pssword to ssh in.
Passwordless Sudo
This is optional. But if you’re like me, don’t want to repeat lots of typing on installation, and see if you can use bash or R to install packages, you’d need this.
It would prompt you to enter your password. You would have to do this for all your nodes
Send Multiple Commands Via SSH
Install R
for host in username1@192.168.1.101 username2@192.168.1.102 username3@192.168.1.103; do
ssh -t $host 'sudo apt update && sudo apt install -y r-base r-base-dev'
done
This is basically installing R on all of our clusters one after another.
Create A Template R script For Simulation
Why do we do this? We want to take advantage of the multicore of each nodes as opposed to using clusters on future as the overhead network may add on to the time and makes optimization less efficiency. Instead, we will send a script to each node so that it can fork its own cores to run the simulation. Also, if we specify packages on our script, we can automate the process of installing these packages on our nodes.
What we did above is basically a template script (We are saving this as par_test_script.R), one where we can edit where to begin and end in terms of which iteration to start and end per node. And also instruction to save result. This is when we can put a little more effort in incorporating some instructions to inform us when task is completed (e.g., via email) and also it would also be nice to know what is the ETA of the entire task by perhaps benchmarking how long the first iteration took to complete, then multiple by total iters per node. Again, this can be sent via email, and also maybe only for the first node as opposed to all nodes, so we’re not bombarded with messages beginning and the end.
Install Packages On All Nodes
## List all of our nodes
my_clusters <- list(
c("username1@192.168.1.101"),
c("username2@192.168.1.102"),
c("username3@192.168.1.103"))
## Grab all of the packages needed on our script
packages <- gsub("library(([^)]+))", "1",grep("^library",readLines("par_test_script.R"),value = T))
## Create function to run sudo
remote_r_sudo <- function(host, r_code, intern = FALSE) {
escaped <- gsub('"', '\"', r_code)
cmd <- sprintf("ssh %s 'sudo Rscript -e "%s"'", host, escaped)
system(cmd, intern = intern)
}
## Loop over to install
for (cluster_i in my_clusters) {
print(cluster_i)
for (package in packages) {
command <- sprintf('if (!require("%s")) install.packages("%s")', package, package)
remote_r_sudo(cluster_i, command)
}
}
Make sure your computer doesn’t go to sleep with this. If this is the first time your nodes are installing these extensive libraries, it will take a while. Another way we can do this is to use future_lapply for all nodes and also tmux for all installations so that we don’t need to rely on our local workstation to be turned on to continue with the installation. See below on how we used tmux to set and forget method.
Upload Rscript to Nodes
Alright, now we have installed the appropriate packages above, let’s upload scripts to our nodes.
Distribute Work
num_list <- list()
clust_num <- 3
total_loop <- 1000
div_iter <- total_loop/clust_num
final_iter <- total_loop #only use this for custom e.g., if one node did not work and it's in charge of 300:500, we can put 500 for this and set first_iter as 300
first_iter <- 1
last_iter <- round(div_iter,0) + first_iter
for (i in 1:clust_num) {
if (i == clust_num) {
num_list[[i]] <- paste0(first_iter,":",final_iter)
next
}
num_list[[i]] <- paste0(first_iter,":",last_iter)
first_iter <- round(first_iter + div_iter, 0)
last_iter <- round(last_iter + div_iter, 0)
}
num_list
## [[1]]
## [1] "1:334"
##
## [[2]]
## [1] "334:667"
##
## [[3]]
## [1] "667:1000"
for (i in 1:length(my_clusters)) {
username <- sub("@.*","",my_clusters[[i]])
system(sprintf("sed 's/START:END/%s/g' par_test_script.R > par_test_script1.R & scp par_test_script1.R %s:/home/%s/par_test_script1.R",num_list[[i]],my_clusters[[i]],username))
}
We’ll iterate and insert the appropriate iters for each node and save it to par_test_script1.R. Then upload to each nodes with the code above.
When we did the above on local computer and also in terminal with multicore. It’s still the same! Woo hoo!
The interesting thing is I didn’t have to set future.seed = T or future.seed = some_number for this. However, if we put a number on future.seed, it will return the reproducible data! This is great, next time I’ll just use this seed and I don’t have to use set.seed(i).
Run Rscript
for (i in 1:length(my_clusters)) {
# set your tmux new session name, here we call it "test"
cluster_name <- "test"
# terminate any existing tmux with the existing name
system(sprintf("ssh %s 'tmux kill-session -t %s 2>/dev/null || true'", my_clusters[[i]], cluster_name))
# create new tmux session
system(sprintf("ssh %s 'tmux new-session -d -s %s'", my_clusters[[i]], cluster_name))
# run rscript in tmux
system(sprintf("ssh %s 'tmux send-keys -t %s "Rscript par_test_script1.R > result_%d.txt"' ENTER",
my_clusters[[i]], cluster_name, i))
}
The code above is quite self-explanatory. Once the above code is run and completed, there we have it! it should be running in the background! You can do a spot check and see if it’s actually running. Once completed, we’ll extract the data.
Extract Data
Since we have 10 combinations we want to assess, we will set nums as 1:10 and get our data! On your template script you can set however you want to save your data, and for extraction, just look for those and download them, read and merge! Or however you want to do it.
nums <- 1:10
df <- tibble()
for (num in nums) {
print(num)
for (i in 1:length(my_clusters)) {
response <- system(sprintf("scp %s:tmle_results/simulation_parameters%d.csv simulation_parameters%d.csv", my_clusters[[i]], num, num), intern = F)
if (response == 1) { next }
df_i <- read_csv(paste0("simulation_parameters",num,".csv"), show_col_types = F)
sl_i <- df_i |> filter(parameter == "SL_library") |> pull(value)
df <- rbind(df, df_i |> mutate(method = sl_i, num = num))
}
}
df_sim_param <- df
df <- tibble()
for (num in nums) {
for (i in 1:length(my_clusters)) {
response <- system(sprintf("scp %s:tmle_results/tmle_iterations%d.csv tmle_iterations%d.csv", my_clusters[[i]], num, num), intern = F)
if (response == 1) { print(paste0(my_clusters[[i]]," is missing num", num)) ; next }
df_i <- read_csv(paste0("tmle_iterations",num,".csv"), show_col_types = F) |>
mutate(num = num)
df <- rbind(df, df_i)
}
}
df_iter <- df
Take note that sometimes you may encounter issues, if for some reason the node is unable to complete the task, you can identify it then redistribute those tasks to the entire computer cluster.
Compare Time
Let’s take at our compute time for 1 cluster, 3 cluster with 5-fold cv, 3 cluster with 10-fold cv.
Looking at the time, we can definitely see the improvement of time from 1 cluster to 3 cluster. Take a look at our good old tuned xgboost and logistic regression, it took use previously for a quadcore 3.29 hours to complete, down to 1.8 hours. You’d imagine that if we use 3 pc’s as a cluster, we would notice improvement to ~1.1 hours, but I guess not for xgboost. Will have to investigate this. But if we look at xgboost + logistic regression without tuning, we went from 0.47 hours to 0.17 hours which made sense! Very interesting. Now if we up our CV to 10 fold, then we see that it took longer (makes senses), but still better than using 1 quadcore. I’ve heard people said that if you increase your K-fold CV, you reduce your bias, but increase variance. Let’s see if that’s true in our case here.
Wow, not too shabby! Indeed when we went from cv5 to cv10, we have reduced bias and slightly increased variance! How about that. Everything except gam + lr, which make sense because we don’t really tune them. Though that being said, I wonder what’s under the hood that controls the knot for gam in superlearner. Will need to check that out. With this, it looks like tuned xgboost + lr might have the best numbers. Well, now we’ve seen bias and variance, what about coverage?
as not expecting gam + lr to have so much coverage! But looking at bias from the previous table, it’s actually quite horrible. So it seems like gam + lr is assymetrical in its estimates, sometimes overestimating, sometimes underestimating, leading to a wider confidence interval, hence more coverage. But that being said, it’s not a good estimator because of its bias. Tuned xgboost + glmnet seems to be the best bet here with low bias, low variance and decent coverage.
Wow, I was not expecting gam + lr to have so much coverage! But looking at bias from the previous table, it’s actually quite horrible. Let’s visualize it!
Wow, look at gam + lr’s assymetrical coverage! This is so true then when we’re assessing, a point estimate of coverage is not adequate to assess the global usefulness of a method. We can see that this method is very bias indeed with asymmetrical tails. Since CV5 and CV10 do not have significant difference in coverage, we’ll skip the visualization.
Opportunities for improvement
plenty of opportunities to turn our personal project into a package that will help us
Use parallel computing on local to run system (such as installation) since this takes a lot of time
Write function to let us know when tasks are completed
Write function to estimate time of completion
Write function to redistribute missing iterations
learn openMPI
make a package for the functions above so I can reuse in the future
Lessons Learnt:
used more sprintf with this learning experience when using with system.
learn that in future_lapply in multicore future.seed=100 or whatever number will help reproduce the same data
Made a few pipeline to install packages on multiple nodes
learnt set.seed in multicore works fine
observed reduced bias with increase variance from cv5 to cv10
Thu, 15 Jan 2026 00:00:00 +0000 https://pharmaverse.github.io/blog/posts/2026-01-15_admiral_14/admiral_1.4_release.html
admiral 1.4 is here! There’s nothing like a new {admiral} release to start the new year with a bang, and that’s exactly what we have for you with {admiral} 1.4! This release comes packed with few choice updates and some exciting new features…
[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
admiral 1.4 is here!
There’s nothing like a new {admiral} release to start the new year with a bang, and that’s exactly what we have for you with {admiral} 1.4! This release comes packed with few choice updates and some exciting new features (including our first AI integration!) that will undoubtedly improve your experience creating ADaMs in R. However, in line with our commitment to stability and maturity, we have once again reduced the number of breaking changes to an absolute minimum. Take a look at the rest of the blog post below to learn more about what to expect in {admiral} 1.4.
TLDR: What is in this release?
{admiral} has partnered with kapa.ai to introduce an AI-powered chatbot within every page of our websites.
We have added our first experimental functions: derive_var_nfrlt() and convert_xxtpt_to_hours(). These simplify time point and visit derivations.
There is a new ADAB template script and correspondingly new vignette. These will aid in creating Anti-drug Antibody Analysis Datasets
Several existing parameter derivation functions have been updated for greater robustness.
All the {admiral} templates can now be viewed directly from our website within the new Explore ADaM Templates.
{admiral} is now integrated with NCI-CTCAE version 6.0 lab grading criteria for both SI and US units, with all grading metadata now maintained in a more readable JSON format for improved consistency and maintainability.
derive_var_atoxgr_dir() received a small breaking change: the abnormal_indicator argument was replaced by low_indicator and high_indicator to support the updated CTCAE v6.0 criteria.
derive_var_merged_summary() has been renamed to derive_vars _merged_summary() to align with programming conventions.
Please peruse the Changelog to see the full set of updates in {admiral} 1.4.
Let’s take a few deep-dives to explore some of these updates in more detail.
Deep dive 1: kapa.ai integration
We are thrilled to announce that {admiral} has partnered with kapa.ai under their Open Source Program to introduce an AI-powered chatbot within every page of our website. This LLM chatbot is trained using the documentation pages of {admiral} and all our therapeutic area extension packages as well as the {pharmaversesdtm} and {pharmaverseadam} data packages and the ADaM IG. Indeed, look out for it to be added to every other package in the {admiral} ecosystem in their upcoming releases this month. To use the chatbot, simply click on the “Ask AI” button at the bottom right corner of any website page and start asking away – you may find that you get redirected to a User Guide, a function reference page or the website of another {admiral} package that you didn’t even know existed!
Deep dive 2: Experimental functions!
Experimental functions are a new class of functions, tagged with the “Experimental” badge. This is our way of adding targeted, new functionality that we can quickly improve/modify without breaking our commitment to stability. The badge will be removed once the dev team feels the function is stable, and no deprecation messages will be given to the user if breaking changes are implemented before removing the experimental badge. However, {admiral} will document the breaking change in the Changelog. Once the “Experimental” badge is removed we will proceed with the normal deprecation cycle if needed.
Our first experimental functions help derive variables for time-point analysis. One such variable is NFRLT (Nominal Relative Time from First Dose), which is used extensively in pharmacokinetic analysis. This is the namesake variable for derive_var_nfrlt(). This function can be used to directly derive NFRLT in the Pharmacokinetic Concentrations (PC) domain and in the Exposure (EX) domain. These domains are typically combined to form the Pharmacokinetic Concentrations Analysis Data (ADPC). Under the hood of this function is convert_xxtpt_to_hours(), which converts timepoint variables such as PCTPT to numeric hours using regular expressions. The hours computed from convert_xxtpt_to_hours() are combined with the days from a visit variable such as VISITDY to derive NFRLT. Note that the unit variable FRLTU can also be derived, and there are multiple output unit options (hours, days, weeks, minutes).
The functions can work with any time-point variables from other domains such as LB, VS or ’EG`. There a is also a treatment duration option in the case of infusions and an option for time point ranges such as “4-8H AFTER END OF INFUSION”.
Deep dive 3: Anti-drug-antibody template and vignette
{admiral} 1.4 introduces a new ADaM template script to aid in creating Anti-drug Antibody Analysis Datasets (ADAB). This template is accompanied by a detailed vignette that walks you through the process of deriving key variables and implementing analysis strategies specific to anti-drug antibody data, including the use of one of our new experimental functions derive_var_nfrlt(). The template covers common scenarios encountered in immunogenicity assessments, providing a solid foundation for users to build upon and customize according to their study requirements.
As usual, to use start using the template you can run:
use_ad_template("adab")
…or you can peruse the template directly from the website from our new Explore ADaM Templates page.
What’s coming in admiral 1.5?
{admiral} is very much community driven, so please continue reaching out through Slack or GitHub if you have ideas or requests for enhancements to our package. {admiral} 1.5 is due to release in June 2026, and some current ideas for new content include improved Estimands documentation and ways to enable positive censoring within TTE derivations.
@online{dickinson2026,
author = {Dickinson, Jeff and Straub, Ben and Mancini, Edoardo},
title = {Admiral 1.4 Release},
date = {2026-01-15},
url = {https://pharmaverse.github.io/blog/posts/2026-01-15_admiral_14/admiral_1.4_release.html},
langid = {en}
}
398317 LLMs can’t be trusted to do scientific coding accurately, but humans make mistakes too
https://www.r-bloggers.com/2026/01/llms-cant-be-trusted-to-do-scientific-coding-accurately-but-humans-make-mistakes-too/
Tue, 13 Jan 2026 13:00:00 +0000 https://www.seascapemodels.org/posts/2026-01-14-LLMs-cant-be-trusted-but-neither-can-humans/
I often hear the comment that LLMs/generative AI (large language models) can’t be trusted for research tasks. Image Google’s Nano Banana tasked with “Generate an image of a male African researcher holding a balloon that is pulling them up above…
[This article was first published on Seascapemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I often hear the comment that LLMs/generative AI (large language models) can’t be trusted for research tasks.
Image Google’s Nano Banana tasked with “Generate an image of a male African researcher holding a balloon that is pulling them up above a tidal wave of AI generated slop that is full of errors. The balloon has a research paper inside of it. Generate the image in the style of a Simpsons cartoon.”
But this is the wrong way to think about LLMs. Humans also can’t be trusted to do scientific research accurately. They make mistakes. That’s why we have systems for review.
The more important question is: Are LLMs more accurate than humans at completing a given task?
I actually think LLMs might lead to better scientific coding and statistical analysis.
A common example of what LLMs get criticised for is writing code or performing statistical analyses. The LLM might hallucinate non-truths, or at least mislead you into thinking the analysis you have done is scientifically accurate.
The implication is that we should not be using them for particular tasks, like designing statistical models.
Its right to be skeptical of AI produced output. However, we also need to be skeptical of human produced output. Humans make mistakes as well.
As scientists peer-review is baked into our culture. But code review is much rarer. We also don’t have many systematic reviews of scientific coding that have quantified the rate of mistakes.
I suspect that mistakes in scientific coding are more common than we’d like to believe.
Now I haven’t set an LLM agent the task of doing the equivalent population models to see what its error rate is. However, my tests (which are under review) of agents at quite complicated stats and ecological modelling are showing 80-90% performance at accurately completing the tasks.
So the LLM agents are potentially doing better than the humans and making fewer mistakes.
Why I think LLMs might lead to better research is that they give us more time for code review.
As an ecological modeller I invest a ton of time into writing code, then checking that code works the way I want (and in a mathematically accurate way).
LLMs are now doing more of the code writing for me. Used effectively, this gives me more time to review the code for accuracy, as well as checking the code is an accurate representation of the scientific theory.
A human with an LLM partner could choose to: (1) produce crap work faster than pre-LLM, OR (2) produce higher quality work in a similar amount of time to what it took them pre-LLM.
I’m arguing that we should be aiming to produce the higher quality work. We can do this if we use LLMs to speed up code, then use the extra time for more quality assurance.
More generally, don’t get fooled by the argument that “genAI makes mistakes, so it can’t be trusted”.
Its the wrong way to think about the problem, and I think will lead us to being blind-sided by the oncoming flood of research slop created with genAI.
A better way to think about it is: “genAI and humans both make mistakes, how can we design workflows so that their strengths complement each other and we produce higher quality work”.
This will give us outcomes that are of higher quality than the pre-LLM world, and hopefully will rise above the huge quantity of AI generated slop that is currently happening.
To leave a comment for the author, please follow the link and comment on their blog: Seascapemodels.
398303 GOLD/SILVER RATIO: GenAI with Quant Agents on Microsoft AI Foundry
https://www.r-bloggers.com/2026/01/gold-silver-ratio-genai-with-quant-agents-on-microsoft-ai-foundry/
Tue, 13 Jan 2026 11:44:07 +0000 http://datageeek.com/?p=11590
1. Introduction: The Strategic Edge of Agentic Finance In the contemporary landscape of quantitative finance, the bottleneck is no longer data availability, but the speed of insight generation. Leveraging the Microsoft AI Foundry ecosystem, we have moved beyond static scripting into the realm of Autonomous Financial Agents. This article explores how …
Continue reading: GOLD/SILVER RATIO: GenAI with Quant Agents on Microsoft AI Foundry]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
1. Introduction: The Strategic Edge of Agentic Finance
In the contemporary landscape of quantitative finance, the bottleneck is no longer data availability, but the speed of insight generation. Leveraging the Microsoft AI Foundry ecosystem, we have moved beyond static scripting into the realm of Autonomous Financial Agents. This article explores how a specialized agent can navigate precious metal volatility by analyzing the Gold/Silver ratio with high-performance precision.
2. Infrastructure: Model Deployment on Microsoft AI Foundry
The intelligence behind this analysis is not a local script but a deployed model instance on Microsoft AI Foundry. We utilize the GPT-4o model, deployed as a scalable web service within the Foundry environment.
Endpoint Security: By using the Azure OpenAI service within AI Foundry, we ensure that financial queries and data remain within a secure, enterprise-grade perimeter.
Agentic Logic: The “Agent” is more than just a model; it is a programmed entity with a specific System Prompt that defines its persona as a Quantitative Researcher. This allows the model to “reason” through the necessary steps: from library loading to data merging and final visualization.
3. The Technical Bridge: Python-R Integration
One of the most powerful features of our AI Foundry Agent is its multi-lingual capability. It bridges the gap between Python and R using the rpy2 library, creating a high-performance research pipeline.
The R Ecosystem in Play:
tidyquant & timetk: These packages are the engine for our time-series analysis. tidyquant handles the seamless fetching of GC=F and SI=F data, while timetk manages the complex task of plotting with built-in smoothing algorithms.
dplyr & lubridate: Essential for the “tidy” manipulation of data, allowing the Agent to perform inner_join operations and date-based filtering with surgical precision.
4. Methodology: Taming the Noise with Visual Precision
To extract actionable trends, the Agent is instructed to apply a LOESS smoothing algorithm. By strictly setting .line_size = 1.5 and .smooth_size = 1.5, we ensure the trendline is bold enough to be the primary focus for analysts, effectively “taming” the daily price volatility.
5. Conclusion: Scaling Quantitative Research
The synergy between Microsoft AI Foundry, deployed LLMs, and specialized R packages represents the future of financial research. We have replaced manual data wrangling with an autonomous, standardized agent that can be scaled across thousands of different asset pairs with a single command.
The ABI Connection (Bridging Python to R in VS Code)
For the script to run locally in VS Code, we must establish a robust Application Binary Interface (ABI) connection. This is handled by the rpy2 library, which serves as the translation layer between Python and the R interpreter.
Synchronization: The script uses a localconverter to transform Python data types into R objects in real-time.
Environment Sync: Before the Agent’s code is executed, the script automatically synchronizes the working directory (setwd) so that files generated by R (like the ratio_plot.png) are immediately accessible to the Python environment for rendering.
import os
# Force rpy2 to use ABI mode to avoid the Windows CFFI conflict
os.environ['RPY2_CFFI_MODE'] = 'ABI'
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
print("Interface initialized in ABI mode.")
The Integrated Agent Script:
import os
import httpx
from openai import AzureOpenAI
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
from IPython.display import Image, display
#Microsoft AI Foundry - Azure OpenAI Connection
client = AzureOpenAI(
api_version="2024-12-01-preview",
azure_endpoint="AZURE_OPENAI_ENDPOINT",
api_key="AZURE_OPENAI_KEY",
http_client=httpx.Client(verify=False, trust_env=False)
)
def run_updated_agent(user_request):
system_instructions = (
"You are a Quantitative Researcher. MANDATORY: All output, comments, and labels in English. "
"Strict Operational Guidelines:n"
"1. Libraries: library(tidyquant), library(timetk), library(lubridate), library(dplyr), library(ggplot2).n"
"2. Analysis: Fetch GC=F and SI=F for 3 years, merge via inner_join, and calculate 'ratio'.n"
"3. Visualization: Use timetk::plot_time_series with .interactive = FALSE and .title = "Gold/Silver Ratio".n"
"4. Precision: Set .line_size = 2 and ALWAYS set .smooth_size = 2 for the smoothing line.n"
"5. Set title font face and axis texts font face to 'bold', and size to 16 with theme() function.n"
"6. EXPORT: Save using 'ggsave("ratio_plot.png", width = 10, height = 6)'.n"
"7. Output ONLY raw R code."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_instructions},
{"role": "user", "content": user_request}
]
)
# Cleaning any markdown or headers to get raw code
agent_code = response.choices[0].message.content.strip()
if agent_code.startswith("```"):
agent_code = "n".join(agent_code.split("n")[1:-1])
print("-" * 40)
print(agent_code)
print("-" * 40)
try:
with localconverter(robjects.default_converter + pandas2ri.converter):
# Synchronize working directory
robjects.r(f'setwd("{os.getcwd().replace("", "/")}")')
robjects.r(agent_code)
if os.path.exists("ratio_plot.png"):
display(Image(filename="ratio_plot.png"))
except Exception as e:
print(f"Agent Error: {e}")
# Execution
run_updated_agent("Plot the Gold/Silver ratio for the last 3 years with a smooth line.")
To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.
398254 Predicting Best Picture at the 2026 Academy Awards
https://www.r-bloggers.com/2026/01/predicting-best-picture-at-the-2026-academy-awards/
Tue, 13 Jan 2026 03:33:43 +0000 http://www.r-bloggers.com/?guid=bb51a00e49490a8f96df1cb1ac58e6d0
I’m back with the Oscars Best Picture model, albeit a little late. I had a busy holiday season, but the story of December was surprising: The Secret Agent was the favorite, followed by One Battle After Another. This was largely due to The Secret Agent’s runtime, which is …
Continue reading: Predicting Best Picture at the 2026 Academy Awards]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on Mark H. White II, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
January 12, 2026
I’m back with the Oscars Best Picture model, albeit a little late. I had a busy holiday season, but the story of December was surprising: The Secret Agent was the favorite, followed by One Battle After Another. This was largely due to The Secret Agent’s runtime, which is right in the sweet spot for Best Picture winners.
However, remember from the last two years that my model is assuming these movies have been nominated for Best Picture. The biggest barrier facing The Secret Agent is being nominated; it is a non-English language film. While the only non-English language film to win has been Parasite (2019), the nominations are rare enough that, given that the film has been nominated, having no English dialogue isn’t a barrier to winning. The DGA and PGA nominations make me think The Secret Agent won’t be nominated, however.
It also surprised me that One Battle After Another wasn’t favored more, given Paul Thomas Anderson is a generationally phenomenal writer-director, but none of his films have won Best Picture. Looking into the data, it looks like the “career award” is not much of a thing for Best Picture (like it seems to be for the acting and directing categories). Just the opposite: If a director has had a film nominated or won Best Picture before, it actually hurts their chances of winning in my model.
That was then, this is now, though. No more awards in my models will name nominees or winners before the Oscar nominations. Where do we stand going into the announcement?
One Battle After Another is the favorite, at about 15% chance of winning. Following it closely is The Secret Agent (10%), followed by Marty Supreme (9%), Hamnet (8%), Wicked: For Good (8%), and Frankenstein (8%).
You can read more about the details of the model from my post last year and the year before. The big change I’ve made here is calibrate the probabilities so that the model isn’t too sure of itself. I will see you on the other side of the Oscar nominations.
To leave a comment for the author, please follow the link and comment on their blog: Mark H. White II, PhD.
398238 From scripts to pipelines in the age of LLMs
https://www.r-bloggers.com/2026/01/from-scripts-to-pipelines-in-the-age-of-llms/
Tue, 13 Jan 2026 00:00:00 +0000 https://b-rodrigues.github.io/posts/2026-01-13-data_science_llm_age.html
I was recently reading Davis Vaughan’s blog post Semi-automating 200 Pull Requests with Claude Code and it really resonated with me, as I’ve been using LLMs for tedious tasks like that for some time now. Davis’s key insight: structure = su…
[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I was recently reading Davis Vaughan’s blog post Semi-automating 200 Pull Requests with Claude Code and it really resonated with me, as I’ve been using LLMs for tedious tasks like that for some time now. Davis’s key insight: structure = success. When you can scope a task tightly and provide clear context, LLMs become genuinely useful tools.
If you’ve been following my work, you know that reproducible pipelines have been my main focus for some time now. It’s the reason I wrote {rix} for reproducible R environments, {rixpress} for declarative pipelines, and even a Python port called ryxpress. I genuinely believe these tools make data science better: more reproducible, more debuggable, more shareable.
But I also know that getting people to adopt new tools is hard. Learning a new way of structuring your code takes time and effort, and most people are busy enough already. Here’s where LLMs enter the picture: they can help translate your existing scripts into this more structured format. You provide your monolithic script, explain what you want, and the LLM does the grunt work of restructuring it.
The typical way we write analytics scripts (long chains of %>% calls in R or method-chaining in Python) works fine for interactive exploration, but quickly turns into spaghetti that’s hard to modify, test, or debug. Take my old Luxembourg Airport analysis as an example: it works, but turning that kind of script into a proper pipeline with caching, explicit dependencies, and testability is tedious work.
But we’re in 2026 where LLMs now make this trivial.
From implicit to explicit: translating a script into a rixpress pipeline
Let me show you what I mean by translating that old Luxembourg Airport data cleaning code into a {rixpress} pipeline. The original script uses continuous %>% chaining, standard tidyverse style. The {rixpress} version makes every intermediate step explicit.
But here’s the thing: LLMs don’t know about {rixpress} out of the box. It’s a relatively new package, and it’s not in their training data (or if it is, it’s outdated). So I wrote another tool called pkgctx that extracts structured, LLM-ready API specifications from R or Python packages (and guess what, I did not write a single line of code of that tool, LLMs did).
The idea is simple: run pkgctx r rixpress > rixpress.ctx.yaml and you get a compact YAML file describing every function, its arguments, and its purpose. Feed that to an LLM, and suddenly it “knows” the package. All my packages now have such a file generated on each push thanks to GitHub Actions. Here’s how my conversation went:
Me: I want to use a package called rixpress, but I am not familiar with it. You’ll help me, and here is a file with a detailed description of the API, for context.
[uploaded rixpress.ctx.yaml]
LLM: Got it. I’ll use this API specification as the authoritative reference for rixpress and help you write correct, idiomatic pipelines with it.
A few orientation points to set the frame (no action required yet):
rixpress is declarative: you describe derivations (rxp_r(), rxp_py(), rxp_qmd(), …), then materialize them with rxp_populate() and rxp_make().
Data flows through the DAG via serialized artifacts (RDS / pickle / Julia Serialization), not in-memory objects.
Language boundaries are explicit (rxp_r2py(), rxp_py2r()).
Nix is the execution engine; default.nix (or equivalents) define the runtime environment.
When you’re ready, tell me what you want to do […]
Then I simply asked:
Me: Help me translate this R script into a rixpress pipeline: [pasted the old script]
And that’s how I got a working {rixpress} pipeline. The LLM did the tedious restructuring; I reviewed the output, made minor tweaks, and was done. The combination of pkgctx for context and a clear task (“translate this script”) made the LLM genuinely useful.
Now let’s look at what the translated pipeline looks like. First, let’s assume:
The data file avia_par_lu.tsv is in the project directory
Required R packages are available via default.nix (we’ll also use an LLM for this one)
The project has been initialized with rxp_init() (this sets up two skeleton files to get started quickly)
Now this is a faithful “translation” of the script into a {rixpress} pipeline, however, the original data is now not available anymore, and recent data sets have changed slightly, which means that this script would need further adaptation to the current data source. Otherwise, this would be it! You can view the updated script here (I have also removed all the recoding of factors, because there seems to be something wrong with how {rixpress} handles `, so writing this blog post actually help me find something to fix!)
Generating the environment
I also used an LLM to generate the {rix} script that sets up the reproducible environment for this pipeline. I gave it the rix.pkgctx.yaml context file (generated with pkgctx r rix > rix.pkgctx.yaml, which is also available on the rix GitHub repo) and asked: “Using this knowledge, write me an R script that uses rix to set up the right default.nix for this pipeline.”
The LLM correctly identified the packages needed from the pipeline code:
There’s only one issue with that script: the selected date is not valid, it should instead be the 12th of January. But that’s actually my fault: the LLM had no way of knowing that. The only way it could have known is if I had told it to look at the csv file that lists all the valid dates on {rix}’s repository. But after changing the date, it becomes possible to run this script, then nix-build to build the environment and nix-shell to drop into it. From there, run your pipeline.
What we’ve done here is use LLMs at every step:
Gave context about rixpress (via pkgctx) and asked the LLM to translate my old script into a pipeline
Gave context about rix (via pkgctx) and asked the LLM to generate the environment setup
The pattern is always the same: context + scoped task = useful output.
Structure + context = outsourceable grunt work
The point I’m making here isn’t really about {rixpress} pipelines specifically. It’s about a broader principle that both Davis Vaughan and I have observed: LLMs are genuinely useful when you give them enough structure and context.
Structure + Context → Scoped Task → LLM can actually help
I’ve written before about how you can outsource grunt work to an LLM, but not expertise. The same applies here. I still had to know what data transformations I needed. I still had to review the output and make adjustments. But the tedious restructuring (turning a monolithic script into a declarative pipeline) is exactly the kind of work LLMs can handle if you set them up properly.
If you want LLMs to help with your data science work:
Give them context. Use tools like pkgctx to feed them API specifications. Paste your existing code. Show them examples.
Scope the task tightly. “Translate this script into a rixpress pipeline” is a well-defined task. “Make my code better” is not.
Review the output. LLMs do grunt work; you provide expertise.
If you’re not familiar with {rixpress}, check out my announcement post or the CRAN release post. And if you want to give LLMs context about R or Python packages, pkgctx is there to help. For those who want to dive deeper into Nix, {rix}, and {rixpress}, I’ve recently submitted a paper to the Journal of Statistical Software, which you can read here. For more examples of {rixpress} pipelines, check out the rixpress_demos repository.
LLMs aren’t going anywhere: the genie is out of the bottle. I still see plenty of people online claiming that LLMs aren’t useful, but I genuinely believe it comes down to one of two things:
They’re not providing enough context or scoping their tasks well enough.
They have a principled objection to LLMs, AI, and automation in general which, ok, whatever, but it’s not a technical argument about usefulness.
Some people might even say that to feel good about themselves: what I program is much too complex and important for mere LLMs to be able to help me. Ok perhaps, but not all of us are working for NASA or whatever. I’ll keep on outsourcing the tedious grunt work to LLMs.
398257 Open call for the rOpenSci Champions Program 2026!
https://www.r-bloggers.com/2026/01/open-call-for-the-ropensci-champions-program-2026/
Mon, 12 Jan 2026 00:00:00 +0000 https://ropensci.org/blog/2026/01/12/programchamps2026/
Read it in: Español. We are pleased to announce the opening of a new call for applications for the rOpenSci Champions Program in Spanish, which will begin in 2026. We will be accepting applications beginning in January 12, 2026 and until February 20, …
We are pleased to announce the opening of a new call for applications for the rOpenSci Champions Program in Spanish, which will begin in 2026. We will be accepting applications beginning in January 12, 2026 and until February 20, 2026 for both the roles of Champion as well as Mentor.
As in the previous cohort, the 2026 program will be developed entirely in Spanish and will have a regional focus on Latin America with the objective of further strengthening the research and open science software in this region.
The Community Call on January 21 will feature the participation of Champions and Mentors from previous cohorts. They will share their experiences and answer questions about the program, and we invite you to join us! See the recording of last year’s event.
During February we will hold one application clinic. It is an open space where you will be able to receive help to complete the application form, resolve doubts and receive direct guidance from the program team.
What is the Champions Program?
This program seeks to identify, support and recognize people who are already leading, or who want to take a step further, in building open science and sustainable research software communities.
Throughout 12 months the selected individuals will participate in:
An initial cohort-based training of 13 2-hour workshops.
The development of a research software project: creation of a new R package, submission of an existing R package to rOpenSci peer review, or as part of a team of reviewers of other R packages.
1-on-1 mentoring and cohort events.
Leading community building activities, outreach and communications, and exchange of experiences and strengthening of regional networks.
The program also offers a stipend to recognize the time and work of participants for those who complete the program and a certificate of participation.
Who is it for?
Champions
Potential Champions are people who:
Are from or live and work in Latin America;
Use R and have previous experience in software development, reproducible analysis or open science;
Are interested in developing an open research software project and strengthening their local or regional communities;
Are able to communicate effectively in Spanish (oral and written);
If you have any questions, we invite you to participate in the Community Call in January, join us in the application clinic in February, or contact our Community Manager.
We look forward to your applications with great enthusiasm! We want to continue building community in Latin America together with you.
398223 Retrieval-Augmented Generation: Setting up a Knowledge Store in R
https://www.r-bloggers.com/2026/01/retrieval-augmented-generation-setting-up-a-knowledge-store-in-r/
Thu, 08 Jan 2026 23:59:00 +0000 https://www.jumpingrivers.com/blog/retrieval-augmented-generation-database-workflow-r/
Happy New Year from the team at Jumping Rivers! As we move through the midpoint of the 2020s, it’s a good time to reflect on the changes that we have seen so far in this decade. In the world of data science nothing has dominated headlines quite l…
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Happy New Year from the team at Jumping Rivers!
As we move through the midpoint of the 2020s, it’s a good time to
reflect on the changes that we have seen so far in this decade. In the
world of data science nothing has dominated headlines quite like the
rapid growth and uptake of generative artificial intelligence (GenAI).
Large language models (LLMs) such as ChatGPT, Claude and Gemini have
incredible potential to streamline day-to-day tasks, whether that’s
processing vast amounts of information, providing a human-like chat
interface for customers or generating code. But they also come with
notable risks if not harnessed responsibly.
Anyone that has interacted with these models is likely to have come
across hallucination, where the model confidently presents false
information as though it is factually correct. This can happen for a
variety of reasons:
LLMs often have no access to real-time information: how would a model
that was trained last year know today’s date?
The training data may be missing domain-specific information: can we
really trust an off-the-shelf model to have a good understanding of
pharmaceuticals and medicinal drugs?
The model may be over-eager to come across as intelligent, so it
decides to provide a confident output rather than a more nuanced,
honest answer.
Often we need to give the model access to additional contextual
information before we can make it “production-ready”. We can achieve
this using a retrieval-augmented generation (RAG) workflow. In this
blog post we will explore the steps involved and set up an example RAG
workflow using free and open source packages in R.
What is RAG?
In a typical interaction with an LLM we have:
A user prompt: the text that is submitted by the user.
A response: the text that is returned by the LLM.
(optional) A system prompt: additional instructions for how the LLM
should respond (for example,
"You respond in approximately 10 words or less").
In a RAG workflow we provide access to an external knowledge store which
can include text-based documents and webpages. Additional contextual
info is then retrieved from the knowledge store (hence “retrieval”)
and added to the user prompt before it is sent. In doing so we can
expect to receive a higher quality output.
How does it work?
Before going further, we must first introduce the concept of
vectorisation.
Contrary to what you might believe, LLMs do not understand non-numerical
text! They are mathematical models, meaning they can only ingest and
output numerical vectors.
So how can a user interact with a model using plain English? The trick
is that mappings exist which are able to convert between numerical
vectors and text. These mappings are called “vector embeddings” and are
used to convert the user prompt into a vector representation before it
is passed to the LLM.
So, when setting up our RAG knowledge store, we have to store the
information using a compatible vector representation. With this in mind,
let’s introduce a typical RAG workflow:
Content: we decide which documents to include in the knowledge
store.
Extraction: we extract the text from these documents in Markdown
format.
Chunking: the Markdown content is split into contextual “chunks”
(for example, each section or subsection of a document might become
a chunk).
Vectorisation: the chunks are “vectorised” (i.e. we convert them
into a numerical vector representation).
Index: we create an index for our knowledge store which will be
used to retrieve relevant chunks of information.
Retrieval: we register the knowledge store with our model
interface. Now, when a user submits a prompt, it will be combined
with relevant chunks of information before it is ingested by the
model.
At the retrieval step, a matching algorithm is typically used so that
only highly relevant chunks are retrieved from the knowledge store. In
this way, we are able to keep the size of the user prompts (and any
incurred costs) to a minimum.
Setting up a RAG workflow in R
We will make use of two packages which are available to install via the
Comprehensive R Archive Network (CRAN).
Both are actively maintained by Posit
(formerly RStudio) and are free to install and use.
{ragnar}
The {ragnar} package provides functions
for extracting information from both text-based documents and webpages,
and provides vector embeddings that are compatible with popular LLM
providers including OpenAI and Google.
We will use {ragnar} to build our knowledge store.
{ellmer}
The {ellmer} package allows us to
interact with a variety of LLM APIs from R. A complete list of supported
model providers can be found in the package
documentation.
Note that, while {ellmer} is free to install and use, you will still
need to set up an API token with your preferred model provider before
you can interact with any models. We will use the free Google Gemini
tier for our example workflow. See the Gemini API
documentation creating
an API key, and the {ellmer}
documentation
for authenticating with your API key from R.
Example RAG workflow
We begin by loading the {ragnar} package.
library("ragnar")
The URL provided below links to the title
page of the “Efficient R
Programming” textbook, written by Robin Lovelace and our very own Colin
Gillespie. We’re going to use a couple of chapters from the book to
construct a RAG knowledge store.
url = "https://csgillespie.github.io/efficientR/"
Let’s use {ragnar} to read the contents of this page into a Markdown
format.
md = read_as_markdown(url)
We could vectorise this information as it is, but first we should split
it up into contextual chunks.
chunks = markdown_chunk(md)
chunks
#> # @document@origin: https://csgillespie.github.io/efficientR/
#> # A tibble: 2 × 4
#> start end context text
#> * <int> <int> <chr> <chr>
#> 1 1 1572 "" "# Efficient R programmin…
#> 2 597 2223 "# Welcome to Efficient R Programming" "## Authorsnn[Colin Gil…
The chunks are stored in a tibble format, with one row per chunk. The
text column stores the chunk text (in the interests of saving space we
have only included the start of each chunk in the printed output above).
The title page has been split into two chunks and we can see that there
is significant overlap (chunk 1 spans characters 1 to 1572 and chunk 2
spans characters 597 to 2223). Overlapping chunks are perfectly normal
and provides added context as to where each chunk sits relative to the
other chunks.
Note that you can visually inspect the chunks by running
ragnar_chunks_view(chunks).
It’s time to build our knowledge store with a vector embedding that is
appropriate for Google Gemini models.
# Initialise a knowledge store with the Google Gemini embedding
store = ragnar_store_create(
embed = embed_google_gemini()
)
# Insert the Markdown chunks
ragnar_store_insert(store, chunks)
The Markdown chunks are automatically converted into a vector
representation at the insertion step. It is important to use the
appropriate vector embedding when we create the store. A knowledge store
created using an OpenAI embedding will not be compatible with Google
Gemini models!
Before we can retrieve information from our store, we must create a
store index.
ragnar_store_build_index(store)
We can now test the retrieval capabilities of our knowledge store using
the ragnar_retreive() function. For example, to retrieve any chunks
relevant to the text Who are the authors of “Efficient R
Programming”? we can run:
relevant_knowledge = ragnar_retrieve(
store,
text = "Who are the authors of "Efficient R Programming"?"
)
relevant_knowledge
#> # A tibble: 1 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text
#> <chr> <int> <list> <int> <int> <list> <lis> <chr> <chr>
#> 1 https://csgi… 1 <int> 1 2223 <dbl [2]> <dbl> "" "# E…
Note that the operators in "Efficient R Programming" have been
used to print raw double quotes in the character string.
Without going into too much detail, the cosine_distance and bm25
columns in the returned tibble provide information relating to the
matching algorithm used to identify the chunks. The other columns relate
to the location and content of the chunks.
From the output tibble we see that the full content of the title page
(characters 1 to 2223) has been returned. This is because the original
two chunks both contained information about the authors.
Let’s add a more technical chapter from the textbook to the knowledge
store. The URL provided below links to Chapter 7 (“Efficient
Optimisation”).
Let’s add this to the knowledge store and rebuild the index.
url = "https://csgillespie.github.io/efficientR/performance.html"
# Extract Markdown content and split into chunks
chunks = url |>
read_as_markdown() |>
markdown_chunk()
# Add the chunks to the knowledge store
ragnar_store_insert(store, chunks)
# Rebuild the store index
ragnar_store_build_index(store)
Now that our knowledge store includes content from both the title page
and Chapter 7, let’s ask something more technical, like What are some
good practices for parallel computing in R?.
It makes sense that we have chunks from Section 7.5, which appears to be
highly relevant to the question. By including the title page and the
start of Chapter 7, the LLM will also have access to useful metadata
in case the user wants to find out where the model is getting its
information from.
Now that we have built and tested our retrieval tool, it’s time to
connect it up to a Gemini interface using {ellmer}. The code below will
create a chat object allowing us to send user prompts to Gemini.
chat = ellmer::chat_google_gemini(
system_prompt = "You answer in approximately 10 words or less."
)
A system prompt has been included here to ensure a succinct response
from the model API.
We can register this chat interface with our retrieval tool.
ragnar_register_tool_retrieve(chat, store)
To check if our RAG workflow has been set up correctly, let’s chat with
the model.
chat$chat("What are some good practices for parallel computing in R?")
#> Use the `parallel` package, ensure you stop clusters with `stopCluster()` (or
#> `on.exit()`), and utilize `parLapply()`, `parApply()`, or `parSapply()`.
The output looks plausible. Just to make sure, let’s check where the
model found out this information.
chat$chat("Where did you get that answer from?")
#> I retrieved the information from "Efficient R programming" by Colin Gillespie
#> and Robin Lovelace.
Success! The LLM has identified the name of the textbook and if we
wanted to we could even ask about the specific chapter. A user
interacting with our model interface could now search online for this
textbook to fact-check the responses.
In the example workflow above, we manually selected a couple of chapters
from the textbook to include in our knowledge store. It’s worth noting
that you can also use the ragnar_find_links(url) function to retrieve
a list of links from a given webpage.
Doing so for the title page will provide the links to all chapters.
You could then iterate through these links, extracting the contents from
each webpage and inserting these into your RAG knowledge store. Just
note, however, that including additional information in your store will
likely increase the amount of text being sent to the model, which could
raise costs. You should therefore think about what information is
actually relevant for your LLM application.
Summary
In summary, we have introduced the concept of retrieval-augmented
generation for LLM-powered workflows and built an example workflow in R
using open source packages.
Before finishing, we are excited to announce that our new course
“LLM-Driven Applications with R & Python” has just been added to our
training portfolio. You can search for it
here.
If you’re interested in practical AI-driven workflows, we would love to
see you at our upcoming AI In Production
2026 conference which is
running from 4-5 June in Newcastle-Upon-Tyne. If you would like to
present a talk or workshop, please submit your abstracts before the
deadline on 23 January.
For updates and revisions to this article, see the original post
To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.
398146 Survivor 49 is now in 📦{survivoR}
https://www.r-bloggers.com/2026/01/survivor-49-is-now-in-%f0%9f%93%a6survivor/
Thu, 08 Jan 2026 19:49:48 +0000 http://gradientdescending.com/?p=3486
Survivor 49 has wrapped up and has been added to the {survivoR} package. It is available on If you find […] The post Survivor 49 is now in 📦{survivoR} appeared first on Dan Oehm | Gradient Descending.
[This article was first published on T. Moudiki’s Webpage – R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
In this post, I present the novelties of python package rtopy; a package allowing (whose ultimate objective is to) translate R to Python without much hassle. The intro is still available in available in https://thierrymoudiki.github.io/blog/2024/03/04/python/r/rtopyintro.
The novelties mainly concern the RBridge class and the call_r function. The RBridge class is more about persistency, while the call_r function is more about ease of use.
See for yourself in the following – hopefully comprehensive – examples (classification, regression, time series, hypothesis testing).
contents
Installation
RBridge class
call_r function
Advanced RBridge Usage Examples
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
%%R
install.packages("pak")
pak::pak(c("e1071", "forecast", "randomForest"))
library(jsonlite)
!pip install rtopy
"""
Advanced RBridge Usage Examples
================================
Demonstrates using R packages, statistical modeling, and data processing
through the Python-R bridge `rtopy`.
"""
import numpy as np
import pandas as pd
from rtopy import RBridge, call_r
# ============================================================================
# Example 1: Support Vector Machine with e1071
# ============================================================================
print("=" * 70)
print("Example 1: SVM Classification with e1071")
print("=" * 70)
# Generate training data
np.random.seed(42)
n_samples = 100
# Class 0: centered at (-1, -1)
X0 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([-1, -1])
# Class 1: centered at (1, 1)
X1 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([1, 1])
X_train = np.vstack([X0, X1])
y_train = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))
# Create R code for SVM training and prediction
svm_code = '''
library(e1071)
train_svm <- function(X, y, kernel_type = "radial") {
# Convert to data frame
df <- data.frame(
x1 = X[, 1],
x2 = X[, 2],
y = as.factor(y)
)
# Train SVM
model <- e1071::svm(y ~ x1 + x2, data = df, kernel = kernel_type, cost = 1)
# Make predictions on training data
predictions <- predict(model, df)
# Calculate accuracy
accuracy <- mean(predictions == df$y)
# Return results
list(
predictions = as.numeric(as.character(predictions)),
accuracy = accuracy,
n_support = model$tot.nSV
)
}
'''
rb = RBridge(verbose=True)
result = rb.call(
svm_code,
"train_svm",
return_type="dict",
X=X_train,
y=y_train,
kernel_type="radial"
)
print(f"Training Accuracy: {result['accuracy']:.2%}")
print(f"Number of Support Vectors: {result['n_support']}")
print(f"Sample Predictions: {result['predictions'][:10]}")
# ============================================================================
# Example 2: Time Series Analysis with forecast package
# ============================================================================
print("n" + "=" * 70)
print("Example 2: Time Series Forecasting with forecast")
print("=" * 70)
# Generate time series data
time_series = np.sin(np.linspace(0, 4*np.pi, 50)) + np.random.randn(50) * 0.1
ts_code = '''
library(forecast)
forecast_ts <- function(x, h = 10) {
# Convert to time series object
ts_data <- ts(x, frequency = 12)
# Fit ARIMA model
fit <- auto.arima(ts_data, seasonal = FALSE)
# Generate forecast
fc <- forecast(fit, h = h)
# Return results
list(
forecast_mean = as.numeric(fc$mean),
forecast_lower = as.numeric(fc$lower[, 2]), # 95% CI
forecast_upper = as.numeric(fc$upper[, 2]),
model_aic = fit$aic,
model_order = paste0("ARIMA(",
paste(arimaorder(fit), collapse = ","),
")")
)
}
'''
result = rb.call(
ts_code,
"forecast_ts",
return_type="dict",
x=time_series.tolist(),
h=10
)
print(f"Model: {result['model_order']}")
print(f"AIC: {result['model_aic']:.2f}")
print(f"5-step forecast: {np.array(result['forecast_mean'])[:5]}...")
# ============================================================================
# Example 3: Random Forest with randomForest package
# ============================================================================
print("n" + "=" * 70)
print("Example 3: Random Forest Regression")
print("=" * 70)
# Generate regression data
np.random.seed(123)
X = np.random.rand(200, 3) * 10
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(200) * 2
rf_code = '''
library(randomForest)
train_rf <- function(X, y, ntree = 500) {
# Create data frame
df <- data.frame(
x1 = X[, 1],
x2 = X[, 2],
x3 = X[, 3],
y = y
)
# Train random forest
rf_model <- randomForest(y ~ ., data = df, ntree = ntree, importance = TRUE)
# Get predictions
predictions <- predict(rf_model, df)
# Calculate R-squared
r_squared <- 1 - sum((y - predictions)^2) / sum((y - mean(y))^2)
# Get feature importance
importance_scores <- importance(rf_model)[, 1] # %IncMSE
list(
r_squared = r_squared,
mse = rf_model$mse[ntree],
predictions = predictions,
importance = importance_scores
)
}
'''
result = rb.call(
rf_code,
"train_rf",
return_type="dict",
X=X,
y=y.tolist(),
ntree=500
)
print(f"R-squared: {result['r_squared']:.3f}")
print(f"MSE: {result['mse']:.3f}")
print(f"Feature Importance: {result['importance']}")
# ============================================================================
# Example 4: Statistical Tests with stats package
# ============================================================================
print("n" + "=" * 70)
print("Example 4: Statistical Hypothesis Testing")
print("=" * 70)
# Generate two samples
group1 = np.random.normal(5, 2, 50)
group2 = np.random.normal(6, 2, 50)
stats_code = '''
perform_tests <- function(group1, group2) {
# T-test
t_result <- t.test(group1, group2)
# Wilcoxon test (non-parametric alternative)
w_result <- wilcox.test(group1, group2)
# Kolmogorov-Smirnov test
ks_result <- ks.test(group1, group2)
list(
t_test = list(
statistic = t_result$statistic,
p_value = t_result$p.value,
conf_int = t_result$conf.int
),
wilcox_test = list(
statistic = w_result$statistic,
p_value = w_result$p.value
),
ks_test = list(
statistic = ks_result$statistic,
p_value = ks_result$p.value
),
summary_stats = list(
group1_mean = mean(group1),
group2_mean = mean(group2),
group1_sd = sd(group1),
group2_sd = sd(group2)
)
)
}
'''
result = rb.call(
stats_code,
"perform_tests",
return_type="dict",
group1=group1.tolist(),
group2=group2.tolist()
)
print(f"Group 1 Mean: {result['summary_stats']['group1_mean']:.2f} ± {result['summary_stats']['group1_sd']:.2f}")
print(f"Group 2 Mean: {result['summary_stats']['group2_mean']:.2f} ± {result['summary_stats']['group2_sd']:.2f}")
print(f"nT-test p-value: {result['t_test']['p_value']:.4f}")
print(f"Wilcoxon p-value: {result['wilcox_test']['p_value']:.4f}")
# ============================================================================
# Example 5: Data Transformation with dplyr
# ============================================================================
print("n" + "=" * 70)
print("Example 5: Data Wrangling with dplyr")
print("=" * 70)
# Create sample dataset
data = pd.DataFrame({
'id': range(1, 101),
'group': np.random.choice(['A', 'B', 'C'], 100),
'value': np.random.randn(100) * 10 + 50,
'score': np.random.randint(1, 101, 100)
})
dplyr_code = '''
library(dplyr)
process_data <- function(df) {
# Convert list columns to data frame
data <- as.data.frame(df)
# Perform dplyr operations
result <- data %>%
filter(score > 50) %>%
group_by(group) %>%
summarise(
n = n(),
mean_value = mean(value),
median_score = median(score),
sd_value = sd(value)
) %>%
arrange(desc(mean_value))
# Convert back to list format for JSON
as.list(result)
}
'''
result = rb.call(
dplyr_code,
"process_data",
return_type="pandas",
df=data
)
print("nGrouped Summary Statistics:")
print(result)
# ============================================================================
# Example 6: Clustering with cluster package
# ============================================================================
print("n" + "=" * 70)
print("Example 6: K-means and Hierarchical Clustering")
print("=" * 70)
# Generate clustered data
np.random.seed(42)
cluster_data = np.vstack([
np.random.randn(30, 2) * 0.5 + np.array([0, 0]),
np.random.randn(30, 2) * 0.5 + np.array([3, 3]),
np.random.randn(30, 2) * 0.5 + np.array([0, 3])
])
cluster_code = '''
library(cluster)
perform_clustering <- function(X, k = 3) {
# Convert to matrix
data_matrix <- as.matrix(X)
# K-means clustering
kmeans_result <- kmeans(data_matrix, centers = k, nstart = 25)
# Hierarchical clustering
dist_matrix <- dist(data_matrix)
hc <- hclust(dist_matrix, method = "ward.D2")
hc_clusters <- cutree(hc, k = k)
# Silhouette analysis for k-means
sil <- silhouette(kmeans_result$cluster, dist_matrix)
avg_silhouette <- mean(sil[, 3])
list(
kmeans_clusters = kmeans_result$cluster,
kmeans_centers = kmeans_result$centers,
kmeans_withinss = kmeans_result$tot.withinss,
hc_clusters = hc_clusters,
silhouette_score = avg_silhouette
)
}
'''
result = rb.call(
cluster_code,
"perform_clustering",
return_type="dict",
X=cluster_data,
k=3
)
print(f"K-means Within-cluster SS: {result['kmeans_withinss']:.2f}")
print(f"Average Silhouette Score: {result['silhouette_score']:.3f}")
print(f"nCluster Centers:n{np.array(result['kmeans_centers'])}")
print(f"nCluster sizes: {np.bincount(result['kmeans_clusters'])}")
print("n" + "=" * 70)
print("All examples completed successfully!")
print("=" * 70)
======================================================================
Example 1: SVM Classification with e1071
======================================================================
Training Accuracy: 100.00%
Number of Support Vectors: 9
Sample Predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
======================================================================
Example 2: Time Series Forecasting with forecast
======================================================================
Model: ARIMA(3,1,0)
AIC: -10.21
5-step forecast: [0.29557391 0.4948255 0.64553023 0.80823028 0.93656539]...
======================================================================
Example 3: Random Forest Regression
======================================================================
R-squared: 0.972
MSE: 11.996
Feature Importance: [62.57255479535195, 86.55470841243113, 21.4933655703039]
======================================================================
Example 4: Statistical Hypothesis Testing
======================================================================
Group 1 Mean: 5.33 ± 2.06
Group 2 Mean: 5.37 ± 2.28
T-test p-value: 0.9381
Wilcoxon p-value: 0.8876
======================================================================
Example 5: Data Wrangling with dplyr
======================================================================
Grouped Summary Statistics:
group n mean_value median_score sd_value
0 C 23 49.711861 76 11.367167
1 A 14 49.219788 74 9.744709
2 B 23 47.459312 80 10.126835
======================================================================
Example 6: K-means and Hierarchical Clustering
======================================================================
K-means Within-cluster SS: 39.38
Average Silhouette Score: 0.713
Cluster Centers:
[[-0.03545142 3.12736567]
[ 2.9470395 3.04927708]
[-0.07207628 -0.0825784 ]]
Cluster sizes: [ 0 30 30 30]
======================================================================
All examples completed successfully!
======================================================================
import matplotlib.pyplot as plt
import seaborn as sns
# Set a style for better aesthetics
sns.set_style("whitegrid")
# Create a scatter plot of the clustered data
plt.figure(figsize=(10, 7))
sns.scatterplot(
x=cluster_data[:, 0],
y=cluster_data[:, 1],
hue=result['kmeans_clusters'],
palette='viridis',
s=100, # size of points
alpha=0.8, # transparency
legend='full'
)
# Plot the cluster centers
centers = np.array(result['kmeans_centers'])
plt.scatter(
centers[:, 0],
centers[:, 1],
marker='X',
s=200, # size of centers
color='red',
edgecolors='black',
label='Cluster Centers'
)
plt.title('K-means Clustering of Generated Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
398159 Directional markers in R/leaflet
https://www.r-bloggers.com/2026/01/directional-markers-in-r-leaflet/
Wed, 07 Jan 2026 22:25:16 +0000 http://nsaunders.wordpress.com/?p=5702
So you have used the excellent exiftool to extract all of the GPS-related information from a directory of photos in JPG format and write to a CSV file: You’ve used R/leaflet to plot coordinates (latitude and longitude) before, but what about that tag named GPSImgDirection? It would be …
Continue reading: Directional markers in R/leaflet]]>
[social4i size=”small” align=”align-left”]
–>
So you have used the excellent exiftool to extract all of the GPS-related information from a directory of photos in JPG format and write to a CSV file:
exiftool '-*GPS*' -ext jpg -csv . > outfile.csv
You’ve used R/leaflet to plot coordinates (latitude and longitude) before, but what about that tag named GPSImgDirection? It would be nice to have some kind of marker which indicates the direction in which you were facing when the photo was taken.
For me, a Google search provided hints but not one single, obvious straightforward solution to this problem (the generative AI effect? time will tell…), so here’s what I’ve put together from several sources, in particular this StackOverflow post.
The key points are:
use awesomeIcons() to create a directional icon which can be rotated
add the icons to your map using addAwesomeMarkers()
Here’s some example code which uses the Font Awesome icon long-arrow-up. Since “up” (north) corresponds to zero degrees, applying a rotation corresponding to GPSImgDirection should result in the correct orientation for the marker. The GPS-related tags in this case come from an iPhone 13.
398155 Rugby Analytics with R: Complete Guide to Performance Analysis in Rugby Union and League
https://www.r-bloggers.com/2026/01/rugby-analytics-with-r-complete-guide-to-performance-analysis-in-rugby-union-and-league/
Wed, 07 Jan 2026 19:46:55 +0000 https://rprogrammingbooks.com/?p=2391
Rugby is a sport defined by collisions, structure, and constant tactical adaptation. Unlike many other invasion sports, rugby alternates between highly structured moments—scrums, lineouts, restarts—and extended passages of chaotic open play. Each phase generates rich performance data: tackles, rucks, carries, kicks, meters gained, penalties conceded, turnovers, and spatial …
[This article was first published on Blog – R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Rugby is a sport defined by collisions, structure, and constant tactical adaptation. Unlike many
other invasion sports, rugby alternates between highly structured moments—scrums, lineouts,
restarts—and extended passages of chaotic open play. Each phase generates rich performance data:
tackles, rucks, carries, kicks, meters gained, penalties conceded, turnovers, and spatial changes
in territory. Despite this richness, rugby analytics has historically lagged behind other sports,
especially in terms of open, reproducible analytical workflows.
This gap presents a clear opportunity. R provides a complete environment for rugby performance
analysis: data acquisition, cleaning, modeling, visualization, and automated reporting. For
analysts, sports scientists, and coaches, R enables evidence-based decision-making that goes far
beyond traditional statistics and subjective video review.
Why rugby analytics requires a different analytical mindset
Rugby is not a possession-by-possession sport in the same way as basketball, nor a continuous-flow
game like football. Possession can be short or long, territory often matters more than time on the
ball, and a single penalty can flip match momentum. Analytics must therefore respect rugby’s
unique structure.
Simple totals—tackles, carries, meters—are insufficient on their own. Analysts must consider game
state, field position, opposition quality, and player role. R makes it possible to incorporate
this context systematically and consistently across matches and seasons.
Data acquisition in rugby: scraping, APIs, and internal feeds
Public rugby data is fragmented and inconsistent. Analysts often combine multiple sources to build
a usable dataset. R is particularly well suited to this challenge because it supports web scraping,
API consumption, and database integration within a single workflow.
# Core libraries for rugby data acquisition
library(tidyverse)
library(rvest)
library(httr)
library(jsonlite)
# Example: pulling match data from an API
response <- GET("https://api.example.com/rugby/match/9876")
raw_json <- content(response, "text")
match_data <- fromJSON(raw_json)
Web scraping is often necessary when APIs are unavailable. This requires careful handling of HTML
structure, rate limits, and data validation to ensure accuracy and reproducibility.
# Scraping a match statistics table
page <- read_html("https://example-rugby-site.com/match/9876")
team_stats <- page %>%
html_node("table.match-stats") %>%
html_table()
team_stats
Data cleaning and validation: a critical but underestimated step
Rugby datasets are rarely analysis-ready. Player substitutions, injury replacements, and data
entry inconsistencies introduce errors that can distort results if left unchecked.
Validation logic should be embedded directly into the pipeline. This ensures that every new match
is processed consistently, reducing human error and analyst workload.
Transforming events into rugby-specific units of analysis
Raw events are only the starting point. Meaningful rugby analysis requires transforming events
into units such as phases, possessions, sets, and passages of play.
These structures allow analysts to study momentum, ruck efficiency, and attacking intent in a way
that aligns with how coaches understand the game.
Advanced player performance analysis with R
Player evaluation in rugby must be contextual and role-specific. Front-row players, halves, and
outside backs contribute in fundamentally different ways.
Rate-based metrics reveal impact more effectively than totals, especially when comparing starters
to bench players or evaluating performance across different match contexts.
Defensive systems analysis: beyond individual tackles
Effective defense is systemic. Missed tackles often result from spacing errors, fatigue, or poor
decision-making rather than individual incompetence.
Defensive analytics should highlight structural weaknesses and workload imbalances,
not just individual error counts.
Territory, kicking strategy, and spatial dominance
Territory remains a core determinant of success in rugby. Teams that consistently win the
territorial battle reduce defensive workload and increase scoring opportunities.
Spatial analysis allows analysts to quantify whether a team’s kicking strategy aligns with its
stated game model and environmental constraints.
Win probability and decision modeling in rugby
Win probability models convert complex match states into intuitive probabilities. In rugby, these
models must account for score, time, territory, possession, and discipline risk.
# Building a basic win probability model
wp_data <- matches %>%
mutate(
score_diff = team_score - opponent_score,
time_remaining = 80 - minute
)
wp_model <- glm(
win ~ score_diff + time_remaining + territory,
data = wp_data,
family = binomial()
)
summary(wp_model)
Even simple models provide immediate value by framing tactical decisions—such as kicking for touch
versus taking the points—in probabilistic terms.
Automated reporting and reproducible workflows
The final step in rugby analytics is communication. R enables analysts to automate reporting,
ensuring consistency and freeing time for deeper insight generation.
# Creating a clean match summary table
summary_table <- team_stats_clean %>%
select(team, possession, territory, tackles, line_breaks, penalties_conceded)
knitr::kable(summary_table)
Automated reports ensure that analysis becomes part of the weekly rhythm rather than an optional
extra.
The strategic opportunity in rugby analytics with R
There is clear and growing interest in rugby analytics, but very little comprehensive, R-focused
content. Analysts, sports scientists, and coaches are actively searching for practical guidance.
A dedicated, end-to-end approach—covering data acquisition, performance metrics, modeling, and
reporting—fills a genuine gap and establishes authority in a niche with minimal competition.
A complete, practical guide for applying R to real-world rugby performance analysis, designed
for analysts, sports scientists, and coaches working in Rugby Union and Rugby League.
398134 R Studio or Positron? Time To Switch?
https://www.r-bloggers.com/2026/01/r-studio-or-positron-time-to-switch/
Wed, 07 Jan 2026 08:00:00 +0000 https://ozancanozdemir.github.io/r/R-Studio-Positron
I remember the day that I started to use R programming. I had a basic interface to write and execute the code. After that experience, R Studio emerged as a powerful IDE for R programming for me. It provided a user-friendly interface, integrated tools…
[This article was first published on Ozancan Ozdemir, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I remember the day that I started to use R programming. I had a basic interface to write and execute the code. After that experience, R Studio emerged as a powerful IDE for R programming for me. It provided a user-friendly interface, integrated tools, and features that enhance productivity and streamline the coding process and was a huge shift for me in my R programming journey.
In July 2022, R Studio was rebranded to Posit. Apparently, a new era was about to start because the world’s needs were evolving, and R had a stronger companion in the Python programming language.
R Studio Interface (Source: biocorecrg.github.io)
To satisfy the needs of both R and Python users, Posit introduced a new product called Positron. It is a data science-oriented IDE that supports both R and Python programming languages, in contrast to R Studio. Of course, this emerging tool has tempted some R Studio users who are also using VSCode since it offers some advantages over R Studio.
The main difference between Positron and R Studio is their multi-language support. Positron allows users to work with both R and Python in a single environment, making it easier for data scientists who use both languages. Additionally, Positron offers better integration with Jupyter Notebooks, which are widely used in the data science community.
AI-based assistants are also integrated into Positron, providing users with suggestions and code completions based on their coding patterns. This feature can significantly enhance productivity and reduce the learning curve for new users.
If you are playing with the data, it offers more flexibility and versatility compared to R Studio. You can examine not only the data frames on your enviroment, but also .csv and parquet without importing them.
Another advantage of Positron is to offer extensions that make the IDE more customizable and adaptable to different workflows. Users can install extensions to add new features, improve functionality, and tailor the environment to their specific needs.
Package versions and R versions crashes sometimes becomes annoying if you have encounter this during your R Studio experiment. But, with Positron, you can manage different R versions at the same time on the same machine without conflicts. This is particularly useful for users who work on multiple projects with different R version requirements.
Lastly, it is being improved continuously with frequent updates and new features being added regularly. This ensures that users have access to the latest tools and technologies in the data science field.
So the question is: Should we give up on using R Studio?
Actually, no. Because it is not going away, and it still provides some advantages over Positron.
R Studio still has strong properties that tempt users use it. You can use RMarkdown and Quarto to create dynamic documents, reports, and presentations that combine code, text, and visualizations. R Studio also has a robust ecosystem of packages and extensions that enhance its functionality and provide specialized tools for various data analysis tasks.
You can save and reload your workspace. Besides, you have several panels that help you to manage your files, plots, packages, and help documents easily. You can track your codes written in the past and bring them back easily without spending long time. And, you can import your datasets without typing code!
From a developer perspective, R Studio has specific tools that makes developing packages and app easier compared to Positron.
In conclusion, both Positron and R Studio have their own strengths and weaknesses. The choice between the two ultimately depends on the user’s specific needs and preferences. If you require multi-language support, better Jupyter integration, and AI-based assistance, Positron may be the better choice. However, if you prioritize RMarkdown, a robust package ecosystem, and workspace management, R Studio may be more suitable.
To leave a comment for the author, please follow the link and comment on their blog: Ozancan Ozdemir.
398136 So how much does OpenAI owe us?
https://www.r-bloggers.com/2026/01/so-how-much-does-openai-owe-us/
Wed, 07 Jan 2026 00:00:00 +0000 https://datascienceconfidential.github.io/economics/ai/llm/r/2026/01/07/so-how-much-does-openai-owe-us.html
Introduction: Copyright Law and Whatnot
I recently watched a clip from a debate between Timothy Nguyen of Google Deepmind and Danish author Janne Teller. The debate, entitled Technology and Freedom, took place at Hay-on-Wye in summer 2025. On the subj…
[This article was first published on datascienceconfidential – r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Introduction: Copyright Law and Whatnot
I recently watched a clip from a debate between Timothy Nguyen of Google Deepmind and Danish author Janne Teller. The debate, entitled Technology and Freedom, took place at Hay-on-Wye in summer 2025. On the subject of copyright, Nguyen says:
The reason AI is so powerful is because it’s scraped all this data on the internet and of course that has all these issues in terms of copyright law and whatnot. But that’s also democratised knowledge and so there are ways in which it’s been good and bad. But now we have this very intelligent system which has all this knowledge from books, but then maybe there are going to be some authors who aren’t going to be very happy. So there are always going to be winners and losers.
Teller replies:
This is an undermining of any intellectual property rights we have developed up to now. Anything you have written on a Facebook post which is public will be considered by this Metaverse as something they can use to develop their AI and you might say OK, that’s a new form of sharing. Anything you contribute, everybody owns it. But then that speaks to nationalising all technology platforms. You want to have everything everyone else has created. But then we want to have your work also and have control over it.
The clip cuts off here and I haven’t seen a video of the full debate, so I don’t know how Nguyen replied. But I think Teller made a good point. It’s not just that LLMs have been trained (illegally) on masses of copyrighted material. They have also been trained on data from the internet, which is a public good, and perhaps the people who unwittingly created all the training data should be entitled to some sort of compensation. Even the slopigarchs themselves acknowledge this. For example, in 2017, Elon Musk said that the pace of change is:
a massive social challenge. And I think ultimately we will have to have some kind of universal basic income (UBI). I don’t think we’re going to have a choice.
At the moment we are facing two possible outcomes. Either AI progress grinds to a halt and the bubble bursts, or AI breakthroughs continue to happen at a rapid pace, replacing human jobs, and everyone ends up becoming unemployed until they can find other jobs to do. Every previous improvement in technology, no matter how disruptive, eventually ended up with people finding other things to do, so the economy will keep going somehow. But before we reach that point we may find ourselves facing serious social unrest. As Teller suggests, perhaps it is the AI companies themselves who should pay for this. After all, they did steal everyone else’s work to train their models. But if, in some grim future in which companies like OpenAI become profitable, we eventally get compensation, how much compensation should we get?
It seems like this question has no answer. But actually there’s a simple heuristic for evaluating the relative contributions of the model and the data, which I want to explain in this post. Not only is this heuristic relevant for musing about the future of AI, but it’s also surprisingly useful in everyday data science, too.
The Cover-Hart Theorem
Consider a classification problem in which the input is a data point $x$ contained in some metric space (i.e. a set equipped with a notion of distance) $(X, d)$, and the output is a classification into one of $M$ classes. The classifier is evaluated by the percentage of data points which it classifies correctly (the accuracy). If $A$ is the accuracy then $R = 1-A$ is called the error rate.
The Bayes Rate $R^ast$ for the problem is defined to be the best possible accuracy which any classifier could have. Why isn’t $R^ast$ just 100%? That’s because the same point might appear in more than one class! See the example below.
Suppose a data set $mathcal{X}$ is given. It consists of some points $x_i in X$ and the corresponding classes $theta_i$. We want to use the data set $mathcal{X}$ to build a classifier.
The 1-Nearest Neighbour or 1-NN classifier is the classifier $C$ which simply assigns an unseen data point $x$ to the class of the closest point to $x$ in $mathcal{X}$ (for simplicity, let’s assume that $mathcal{X}$ doesn’t contain any duplicate points). That is, if $d(x, x_i) = min_{y in mathcal{X}}d(x, y)$ then $C(x) := theta_i$. Note that to define the 1-NN classifier, we need $X$ to be a metric space, or else there is no notion of the nearest neighbour.
asymptotically as the number of data points in $mathcal{X}$ goes to $infty$, and provided that the points in $mathcal{X}$ are an iid sample from some distribution.
In other words, if you are given a data set and asked to build a predictive model, just doing the most naive thing possible and looking up the closest point in your data set to the point you want to classify already gets you halfway to the lowest possible error.
Example
Here is an example which I used to use when teaching this topic in university courses.
Let’s consider a single predictor $x$. There are two classes labelled $0$ and $1$. The distribution of $x$ for class $1$ is $N(1, 1)$ and the distribution of $x$ for class $0$ is $N(-1, 1)$. Suppose the population is equally distributed among the two classes.
The best possible classifier would classify a point $x$ into whichever class has the higher density for that particular value of $x$. The purple area represents the proprtion of points which would be misclassified. Since 50% of the population is in each class, the purple area is equal to
bayes_rate <- 1-pnorm(1)
# 0.1586553
Now suppose we are supplied with a training dataset consisting of 50 points from each class
set.seed(100)
N <- 100
df_train <- data.frame(x = c(rnorm(N/2, 1, 1), rnorm(N/2, -1, 1)), y = rep(c(1, 0), each=N/2))
The following function classifies a point using the nearest neighbour with the metric being $d(x, y) = lvert x-y rvert$.
If you run the whole script again with the same seed but with N=10000 points in the training data, you will even find that logistic regression gets an error rate which is lower than the Bayes rate! This happens because the training and test sets are finite samples from the actual data distribution, so there is some sampling error.
Practical Use
There are two ways to use this in practice. First, suppose that you are presented with a data set and build a quick and dirty classifier using 1-NN and achieve an accuracy of 80%. Then the error rate $R$ of the 1-NN classifier is 20% and the Cover-Hart Theorem tells you that the Bayes rate $R^ast ge R/2$, so the Bayes rate cannot be less than 10%, which means that you can’t expect to achieve an accuracy of better than 90% using some other algorithm. This might be a helpful guide to how much time you should spend trying to build a better classifier. In practice, the quick and dirty classifier you build will be something other than 1-NN1, and it usually has better performance than 1-NN, so this can actually be a useful way to estimate the Bayes rate on a new data set.
Secondly, suppose that you are presented with a classification algorithm with an accuracy of 95%. Then you can estimate that the Bayes rate $R^ast$ is at most 5%, because $R^ast$ is the lowest possible error rate among all classifiers. This means that the error rate of a 1-NN classifier $R$ cannot be more than 10%. But that means that a 1-NN classifier would have given you 90% accuracy. Since the 1-NN classifier is just another name for “look at the data”, you could already achieve 90% accuracy by looking at the data alone without building your fancy model. In other words, the data is doing $90/95 = 94.7%$ of the work!2
Problems with the Cover-Hart Theorem
In practice, Cover-Hart should be used only as a heuristic and not as something which is expected to hold in all cases. This is because it makes very strong assumptions about the data.
It’s only true if you have infinitely many data points, so it will only be approximately true in any real-life situation. How close the Cover-Hart Theorem is to being true in any real-life situation might also depend very strongly on the metric being used.
More seriously, the data points need to be independent and identically distributed (iid). This is never true, despite the fact that textbooks and courses seem to suggest otherwise. In fact, I think it’s rare for training and test sets even to come from the same distribution.
For example, consider image classification. Cover-Hart suggests that you can classify any image correctly if you find the closest image, perhaps in the sense of Euclidean distance, in some sufficiently large reference data set. But clearly the reference data set would have to be massive, and the cost of searching for the closest image would probably be extremely high.
What does Cover-Hart say about AI?
The Cover-Hart Theorem, then, doesn’t suggest a sensible way to build an AI model. For example, suppose you want to generate the next word, given a string of text. A 1-NN classifier would be supplied with a corpus of data. It would need to search through this data and find the piece of text that was the closest match to the given string, and then extract the next word from that piece of text. For some kinds of text, like The capital of France is, this might work well, but clearly it’s not going to be a good approach in general.
This isn’t how Large Language Models work at all, so how is the Cover-Hart Theorem relevant to LLMs? Well, I think it could be used as a heuristic for measuring the relative contribution of the model and the data. For example, let’s suppose an LLM has an accuracy performance of $A$ percent on some benchmark. Then, as explained above, a 1-NN classifier could be expected to achieve an error rate of $2(1-A)$ and so you could estimate that the data by itself is contributing roughly
[frac{1-2(1-A)}{A} = frac{2A-1}{A}]
of the overall performance. This could be taken as a measure of how much the data is “worth” versus the model.
For example, if an AI company achieves 80% on some benchmark, then the people who contributed the data in some sense deserve $(2(0.8)-1)/0.8 = 75%$ of the credit.
Who was right?
So was Teller correct? Do the people who generated the data deserve most of the profits (if there are any) from AI? Well, that depends on what you mean by “AI”.
In the case of LLMs, assuming that they really are able to replace people in the workplace, I think the Cover-Hart Theorem could provide a first step for deciding how to regulate or tax. But the term “AI” encompasses a lot of different models, and some of those models don’t use training data at all. For example, AlphaZero reached grandmaster-level performace in chess and superhuman performance in go by playing against itself. And this isn’t a new idea; in the 1990s TD-Gammon was already able to outperform humans in backgammon by taking a similar approach. Personally I find these kind of algorithms even more impressive than LLMs, but that’s just my opinion.
One more thing. Suppose we did find ourselves in a world in which a government was choosing to tax AI companies based on the above formula. Then we could reach a bizarre scenario in which, in order to avoid tax, the AI companies would be competing to make the ratio $(2A-1)/A$ as small as possible. This would mean that, instead of boasting about the accuracy of their models on self-chosen benchmarks, these firms would find themselves in a paradoxical race to claim that their accuracy was as low as possible.
I think that would be hilarious.
[1] By the way, the 1-NN classifier is one of the very few classifiers which outputs just a class without any notion of the strength of class membership, so you can’t define an AUC for it. This is one of the classifiers which suffers from the so-called class imbalance problem, which they ask about in every data science interview. In pratice, class imbalance is never really a problem because nobody compares classifiers by using accuracy alone.
[2] Of course, this might not be the full story. For one thing, you will probably be interested in other metrics besides accuracy. For another, your algorithm might have other advantages over the 1-NN classifier, such as coming up with predictions more quickly.
To leave a comment for the author, please follow the link and comment on their blog: datascienceconfidential – r.
398167 rOpenSci Code of Conduct Annual Review
https://www.r-bloggers.com/2026/01/ropensci-code-of-conduct-annual-review-7/
Wed, 07 Jan 2026 00:00:00 +0000 https://ropensci.org/blog/2026/01/07/conduct2025/
Read it in: Español. rOpenSci’s activities and spaces are supported by a Code of Conduct that applies to all people participating in the rOpenSci community, including rOpenSci staff and leadership. It applies to all modes of interaction including GitH…
rOpenSci’s activities and spaces are supported by a Code of Conduct
that applies to all people participating in the rOpenSci community,
including rOpenSci staff and leadership.
It applies to all modes of interaction including GitHub project repositories,
the rOpenSci discussion forum, Slack, Community Calls, Co-working and social sessions, training and mentoring sessions,
and in person at rOpenSci-hosted events, including affiliated social gatherings.
Our Code of Conduct is developed and enforced by a committee including rOpenSci staff and an independent community member.
Here we report our annual review of rOpenSci’s Code of Conduct,
reporting process, and internal guidelines for handling reports and enforcement.
Updates
No changes of Committee members for 2026.
The committee members are Natalia Morandeira (independent member),
Mark Padgham (rOpenSci Software Research Scientist) and
Yanina Bellini Saibene (rOpenSci Community Manager).
We are responsible for receiving, investigating, adjudicating, enforcing,
and responding to all reports of potential violations of our Code.
No changes have been made to the text of the Code in English, Spanish or Portuguese.
Because the text has remained unchanged, the version number is kept at 2.5, dated January 30, 2024.
Committee Members Activities
An important aspect of our work as Committee members is to make sure that the processes we follow are transparent, consistent, and fair. To support this work we’ve developed a set of templates that guide us through different stages of incident response and reporting. We shared openly these templates and explain them in detail in the blog post “rOpenSci Code of Conduct Committee Templates”.
Yanina Bellini Saibene attended the BoF “Community safety and wellbeing: Can we do more than “good enough”? at the Open Science CZI meeting to discuss Code of Conduct best practices with other organizations in the Open Science movement.
Contact Information
We welcome your feedback by email to conduct at ropensci.org,
and we thank you for continuing to work with us to ensure that rOpenSci remains a safe, enjoyable, friendly and enriching environment for everyone who participates.
Analysis and Long-term implications of {ellmer} for dynamic alt text generation
The key focus of this text is about using a new package in R called {ellmer} to automatically generate alternative (alt) texts for images in interactive applications built with {shiny}. This is noteworthy because alt text improves accessibility, particularly for screen reader users or if an image fails to load. The text further explains that interactivity of apps and dashboards sometimes overlooks accessibility requirements, creating a gap that {ellmer} can fill.
Dynamic alt text generation using {ellmer} features consistency, fault-tolerance, and low cost. It also handles errors gracefully and contributes positively to the user experience, especially for visually impaired users. However, as with any AI tool, the article cautions that alt text generation should not replace human review and attention; it should just supplement it.
Long term, the wider adoption of this dynamic alt text generation approach can vastly improve accessibility across web platforms. By leveraging the automation capabilities of AI and data science, developers can help to create a more inclusive digital space.
Possible Future Developments
Given the possibilities of AI, it’s predictable that advancement in AI would breed more sophisticated and reliable alt text generation tools in the future. Soon, these tools could be made available in more programming languages beyond R. It’s also imaginable for alt text generation tools to be built into visual creation tools such as design apps, image editors and even cameras. Moreover, developers could explore the possibility of a universal library for alt text prediction which could make it easier for others to incorporate this accessibility feature in their own apps.
Additionally, as the demand for digital accessibility grows, we can expect more applications like this to arise in the future that aim to bridge the gap between interactivity and accessibility in various datasets and interfaces.
Actionable Advice
Here are a few steps developers can take to improve accessibility in their apps:
Consider using tools like {ellmer} to generate alt text for images and graphs in your application. It not only improves the accessibility of your apps but also saves time and effort in manually generating alt texts.
Strike a balance between AI-generated alt text and manual review. Remember, the ability to provide context-rich descriptions may vary among different AI tools. Therefore, it’s always a good idea to check and correct the AI-generated alternative text when necessary.
Consider making alt text generation part of your development process. This approach would keep accessibility as a priority, while also making it a natural part of your workflow.