R-bloggers
https://www.r-bloggers.com
R news and tutorials contributed by hundreds of R bloggers
Thu, 22 Jan 2026 23:59:00 +0000
en-US

hourly


1

https://wordpress.org/?v=5.5.17


https://i0.wp.com/www.r-bloggers.com/wp-content/uploads/2016/08/cropped-R_single_01-200.png?fit=32%2C32&ssl=1
R-bloggers https://www.r-bloggers.com
32
32

11524731
Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps https://www.r-bloggers.com/2026/01/using-ellmer-for-dynamic-alt-text-generation-in-shiny-apps/

Thu, 22 Jan 2026 23:59:00 +0000
https://www.jumpingrivers.com/blog/ellmer-dynamic-alt-text/

Alt Text
First things first, if you haven’t heard of or used alt text before, it
is a brief written description of an image that explains context and
purpose. It is used to improve accessibility by allowing screen readers
to describe images, or p…

Continue reading: Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Alt Text

First things first, if you haven’t heard of or used alt text before, it is a brief written description of an image that explains context and purpose. It is used to improve accessibility by allowing screen readers to describe images, or provide context if an image fails to load. For writing good alt text see this article by Havard, but some good rules of thumb are:

  • Keep it concise and relevant to the context of why the image is being used.
  • Screen reader will already say “Image of …” so we don’t need to include this unless the style is important (drawing, cartoon etc).

Alt Text within Apps and Dashboards

I don’t need to list the positives of interactive apps and dashboards, however one of the main ones is interactivity and allowing users to explore data in their own way. This is a great thing most of the time, but one pitfall that is often overlooked is interactivity can overshadow accessibility. Whether it’s a fancy widget that’s hard (or impossible) to use via keyboard or interactive visualisations without meaningful alternative text.

In this post, we’ll look at a new approach to generating dynamic alt text for ggplot2 charts using {ellmer}, Posit’s new R package for querying large language models (LLM) from R. If you are using Shiny for Python then chatlas will be of interest to you.

Why Dynamic Alt Text Needs Care

Automatically generating alt text is appealing, but production Shiny apps have constraints:

  • Plots may re-render frequently
  • API calls can fail or be rate-limited
  • Accessibility should degrade gracefully, not break the app
  • A good implementation should be consistent, fault-tolerant, and cheap to run.

Using {ellmer} in a Shiny App

The first step is setting up a connection to your chosen LLM, I am using Google Gemini Flash-2.5 as there is a generous free tier but other model and providers are available. In a Shiny app, this can done outside the reactive context:

library(ellmer)
gemini <- chat_google_gemini()

## Using model = "gemini-2.5-flash".

Note: You should have a Google Gemini key saved in you .Renviron file as GEMINI_API_KEY, this way the {ellmer} function will be able to find it. More information on generating a Gemini API key can be found, in the Gemini docs.

Then we have the function for generating the alt text:

library(ggplot2)

generate_alt_text = function(ggplot_obj, model) {
 temp <- tempfile(fileext = ".png")
 on.exit(unlink(temp))

 ggsave(
 temp,
 ggplot_obj,
 width = 6,
 height = 4,
 dpi = 150
 )

 tryCatch(
 model$chat(
 "
Generate concise alt text for this plot image.
Describe the chart type, variables shown,
key patterns or trends, and value ranges where visible.
 ",
 content_image_file(temp)
 ),
 error = function(e) {
 "Data visualisation showing trends and comparisons."
 }
 )
}

The function has a few features that will keep the output more reliable:

  • Consistent image size and resolution – helps model reliability when reading axes and labels.

  • Explicit cleanup of temporary files – we don’t need to save the images once text is generated.

  • Error handling – if the model call fails, the app still returns usable alt text. We kept our fallback text simple for demonstration purposes, but you can attempt to add more detail.

  • External model initialisation – only created once and passed in, rather than re-created on every reactive update.

Examples

In this section will just create a few example plots then see what the LLM generates.

simple_plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length) +
 geom_point()
simple_plot
Scatter plot of the Iris data.
simple_plot_alt = generate_alt_text(simple_plot, gemini)
paste("Alt text generated by AI: ", simple_plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (ranging from approximately 4.5 to 8.0) versus Sepal.Width on the x-axis (ranging from approximately 2.0 to 4.5). The data points appear to form two distinct clusters: one with Sepal.Width between 2.0 and 3.0 and Sepal.Length between 5.0 and 8.0, and another with Sepal.Width between 3.0 and 4.5 and Sepal.Length between 4.5 and 6.5.

plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length, colour = Species) +
 geom_point()
plot
Scatter plot of the Iris data coloured by species.
plot_alt =
 generate_alt_text(plot, gemini)
paste("Alt text generated by AI: ", plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (range 4.5-8.0) versus Sepal.Width on the x-axis (range 2.0-4.5), with points colored by Species. Red points, labeled “setosa”, form a distinct cluster with higher Sepal.Width (3.0-4.5) and lower Sepal.Length (4.5-5.8). Blue points, “virginica”, tend to have higher Sepal.Length (5.5-8.0) and moderate Sepal.Width (2.5-3.8). Green points, “versicolor”, are in between, with moderate Sepal.Length (5.0-7.0) and Sepal.Width (2.0-3.5), overlapping with virginica.

complicated_plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length, colour = Species) +
 geom_point() +
 geom_smooth(method = "lm")
complicated_plot
Scatter plot of the Iris data coloured by species with overlayed line of best fit for each species.
complicated_plot_alt =
 generate_alt_text(complicated_plot, gemini)
paste("Alt text generated by AI: ", complicated_plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (range 4.0-8.0) versus Sepal.Width on the x-axis (range 2.0-4.5). Points and linear regression lines are colored by Iris species. Red points, “setosa”, cluster with lower Sepal.Length (4.0-5.8) and higher Sepal.Width (2.8-4.4). Green points, “versicolor”, and blue points, “virginica”, largely overlap, showing higher Sepal.Length (5.0-8.0) and moderate Sepal.Width (2.0-3.8), with “virginica” generally having the longest sepals. All three species exhibit a positive linear correlation, indicated by their respective regression lines and shaded confidence intervals, where increasing sepal width corresponds to increasing sepal length.

As we can see the alt text can be very good and informative when using LLMs. One alternative that I want to point out is actually including a summary of the data behind the plot. This way screen reader users can still gain insight from the plot.

Using Dynamic Alt Text in Shiny

Once generated, the alt text can be supplied directly to the UI:

  • Via the alt argument of plotOutput()
  • Or injected into custom HTML for more complex layouts

Because the text is generated from the rendered plot, it stays in sync with user inputs and filters.

Other Considerations

Some apps may be more complicated and/or have a high number of users. These type of apps will need a bit more consideration to include features like this:

  • Caching alt text for unchanged plots to reduce API usage
  • Prompt augmentation with known variable names or units
  • Manual overrides for critical visuals

Conclusion

AI-generated alt text works best as a supporting tool, not a replacement for accessibility review. I have also found it helpful to let users know that the alt text is AI generated so they know to take it with a pinch of salt.

Dynamic alt text is a small feature with a big impact on inclusion. By combining Shiny’s reactivity with consistent rendering, error handling, and modern LLMs, we can make interactive data apps more accessible by default whilst not increasing developer burden.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps]]>

398453

futurize: Parallelize Common Functions via a “Magic” Touch 🪄 https://www.r-bloggers.com/2026/01/futurize-parallelize-common-functions-via-a-magic-touch-%f0%9f%aa%84/

Thu, 22 Jan 2026 00:00:00 +0000
https://www.jottr.org/2026/01/22/futurize-0.1.0/

I am incredibly excited to announce the release of the futurize package. This launch marks a major milestone in the decade-long journey of the Futureverse project.

Since the inception of the future ecosystem, I (and others) have envisioned a tool …

Continue reading: futurize: Parallelize Common Functions via a “Magic” Touch 🪄]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

The 'futurize' hexlogo with a dark, starry background and a light blue border. The word 'FUTURIZE' appears in bold, orange gradient lettering across the center, with three diagonal orange bars above it. Below, the text reads 'MAGIC TOUCH PARALLELIZATION,' flanked by two small magic wands with sparkles, suggesting effortless parallel computing.

I am incredibly excited to announce the release of the futurize package. This launch marks a major milestone in the decade-long journey of the Futureverse project.

Since the inception of the future ecosystem, I (and others) have envisioned a tool that would make concurrent execution as simple as possible with minimal change to your existing code – no refactoring, no new function names to memorize – it should just work and work the same everywhere. I’m proud to say that with futurize this is now possible – take your map-reduce call of choice and pipe it into futurize(), e.g.

y <- lapply(x, fcn) |> futurize()

That’s it – a “magic” touch by one function! Easy!

(*) Yeah, there’s no magic going on here – it’s just the beauty of R in action.

Unifying the ecosystem

Diagram illustrating how sequential R map-reduce code can be parallelized with | data-recalc-dims= futurize(). On the left, sequential functions such as lapply(…), purrr::map(…), foreach(…) %do%, plyr::llply(…), and others flow into a central box labeled |> futurize() with magic-wand icons, indicating automatic transformation. On the right, the transformed code fans out to multiple parallel workers (Worker 1, Worker 2, Worker 3, …), whose outputs are combined into a single ‘Results’ node.” style=”width: 100%; margin: 1em 0 1em 0; border: 1px solid #eee; padding: 0.2ex;”/>

One of the biggest hurdles in concurrent R programming has been the fragmentation of APIs and behavior. Packages such future.apply, furrr, and doFuture have partly addressed this. While they have simplified it for developers and users, they all require us to use slightly different function names and different parallelization arguments for controlling standard output, messages, warnings, and random number generation (RNG). futurize() changes this by providing one unified interface for all of them. It currently supports:

Here is how it looks in practice. Notice how the map-reduce logic (e.g. lapply()) is identical regardless of the style you prefer:

# Base R
ys <- lapply(xs, fcn) |> futurize()

# purrr
ys <- map(xs, fcn) |> futurize()
ys <- xs |> map(fcn) |> futurize()

# foreach
ys <- foreach(x = xs) %do% { fcn(x) } |> futurize()

The “magic” of one function

The futurize() function works as a transpiler. The term “transpilation” describes the process of transforming source code from one form into another, a.k.a. source-to-source translation. It captures the original expression without evaluating it, then converts it into the concurrent equivalent, and finally executes the transpiled expression. It basically changes lapply() to future.apply::future_lapply() and map() to furrr::future_map() on the fly and it handles options on how to parallelize in a unifying way, and sometimes automatically. This allows you to write parallel code without blurring the underlying logic of your code.

Domain-specific skills

The futurize package includes support also for a growing set of domain-specific packages, including boot, caret, glmnet, lme4, mgcv, and tm. These packages offer their own built-in, often complex, parallelization arguments. futurize abstracts all of that away. For example, instead of having to specify arguments such as parallel = "snow", ncpus = 4, cl = cl, with cl <- parallel::makeCluster(4) when using boot(), you can just do:

# Bootstrap with 'boot'
b <- boot(data, statistic, R = 999) |> futurize()

# Cross-validation with 'caret'
m <- train(Species ~ ., data = iris, method = "rf") |> futurize()

Why I think you should use it

The futurize package follows the core design philosophy of the Futureverse: separate “what” to execute concurrently from “how” to parallelize.

  • Familiar code: You write standard R code. If you remove |> futurize(), it runs the same.
  • Familiar behavior: Standard output, messages, warnings, and errors propagate as expected and as-is.
  • Unified interface: Future options work the same for lapply(), map(), and foreach() and so on, e.g. futurize(stdout = FALSE).
  • Backend independence: Because it’s built on the future ecosystem, your code can parallelize on any of the supported future backends. It scales up on your notebook, a remote server, or a massive high-performance compute (HPC) cluster with a single change of settings, e.g. plan(future.mirai::mirai_multisession), plan(future.batchtools::batchtools_slurm), and even plan(future.p2p::cluster, cluster = "alice/friends").

Another way to put it, with futurize, you can forget about future.apply, furrr, and doFuture – those packages are now working behind the scenes for you, but you don’t really need to think about them.

Installation

You can install the package from CRAN:

install.packages("futurize")

Outro

I hope that futurize makes your R coding life easier by removing technical details on parallel execution, allowing you to stay focused on the logic you want to achieve. I love to hear how you’ll be using futurize in your R code. For questions, feedback, and feature requests, please reach out on the Futureverse Discussions forum.

May the future be with you!

Henrik

To leave a comment for the author, please follow the link and comment on their blog: JottR on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: futurize: Parallelize Common Functions via a “Magic” Touch 🪄]]>

398500

Closing The Loop with Our 2025 Wrap-up https://www.r-bloggers.com/2026/01/closing-the-loop-with-our-2025-wrap-up/

Thu, 22 Jan 2026 00:00:00 +0000
https://ropensci.org/blog/2026/01/22/yearinreview2025/

At the beginning of 2025, we outlined our goals for the year, our 2024 Highlights. As the year started, our work took place in a far more challenging global context than many of us anticipated. Across many countries, science and research faced funding…

Continue reading: Closing The Loop with Our 2025 Wrap-up]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

At the beginning of 2025, we outlined our goals for the year, our 2024 Highlights. As the year started, our work took place in a far more challenging global context than many of us anticipated. Across many countries, science and research faced funding cuts, layoffs, and attacks on diversity, equity, accessibility, and inclusion. These conditions reshaped timelines and capacities for institutions and for the people doing the work, but also reinforced why community-driven, open, and care-centered spaces matter.

This post looks back at what we said we would do and how that work played out in practice throughout 2025. We also share our plans for 2026.

Software peer review: steady growth and shared responsibility

Throughout 2025, software peer review remained a core activity at rOpenSci, with 24 submissions peer-reviewed and approved. We had a continuous flow of package submissions, approvals, and ongoing reviews, reflecting both the demand for high-quality research software and the commitment of our editors and reviewers.

New packages joined rOpenSci in the fields of environmental and climate data, geospatial analysis, linguistics, and statistics, among others. Existing packages received more than 7000 commits from 239 unique users, with 62 of them contributing for the first time to rOpenSci! Package maintainers released new versions, performed ongoing maintenance and responded to user inquiries and issues.

Building on last year’s commitments, we further improved the efficiency and usefulness of our software peer review system, and explored its potential as a model for other communities.

We enhanced the review process by improving dashboards, updating guides, and refining pkgcheck and its GitHub Action, making participation easier for authors and reviewers.

Leadership transitions within the program editorial team helped distribute responsibility and ensure continuity, while ongoing community participation demonstrated that peer review remains a collective effort.

Increasing submissions mean we hope to add more editors to our board in 2026. We will also expand our automation processes, and formulate policies on the use of AI/LLMs in developing and maintaining rOpenSci packages.

Pictures of 17 people that are part of the software peer review editorial team

rOpenSci Software Peer Review Editorial Team

Strengthening shared infrastructure: R-Universe

A major milestone in the R-Universe trajectory was its designation as a Top-Level Project by the R Consortium in late 2024. This status provides guaranteed funding and institutional support for at least three more years and reflects R-Universe’s role as critical infrastructure in the R community. Alongside this, Jeroen Ooms, the R-Universe project lead, gained a seat on the consortium’s Infrastructure Steering Committee, reinforcing R-Universe’s influence on broader R ecosystem development.

Documentation has been a key focus area as R-Universe has matured. In early 2025, the R-Universe team launched a centralized documentation website hosted as a Quarto book, funded in part by Google Season of Docs. This resource consolidates dispersed tutorials, READMEs, and technical notes into a more navigable structure covering browsing, installation, publishing, and infrastructure topics. The goal was not only to improve clarity for new users but also to make contributions to the documentation itself easier for the community.

A refreshed web frontend has improved user experience and maintainability. This redesign streamlined the interface, improved performance, and made it easier for contributors to propose improvements, addressing long-standing usability challenges through the incremental additions of features over time. We are continuing to rewrite and consolidate other pieces of the infrastructure such as the build workflows to make it easier to understand the R-Universe build process, and enable the community to hack workflows and actions to contribute improvements.

Finally a key focus in 2025 that continues in 2026 is expanding collaboration with other organizations managing sizable R package ecosystems. A notable emerging initiative is R-Multiverse, which explores large-scale collaborative package repositories built on R-Universe’s infrastructure. R-Multiverse aims to support curated collections with developmental and production branches, customizable checks, and manual governance steps—features critical to institutional collaboration and stable release practices. In addition, we are helping the BioConductor project modernize their infrastructure and gradually offload some of their build maintenance to us. These illustrate R-Universe’s evolving role not just as a repository, but as an interactive ecosystem management tool.

Multilingual work as infrastructure, not an add-on

In 2024, we framed multilingualism as foundational to open science, not supplementary. We committed to expanding translations, improving workflows, and supporting communities working in languages other than English.

In 2025, following our 2024 commitment, we completed the Portuguese translations of the rOpenSci Dev Guide and continued Spanish revisions of key content, blogs, program materials, and training.

We translated 32,794 words into Portuguese for the Dev Guide and published 14 blog posts on our website, bringing the total to 16 blog posts in Portuguese. In Spanish, we have 62 blog posts and all main website pages are fully translated. Additionally, we host blog content in other languages, including 8 posts in French and 1 in Turkish.

Beyond translation itself, we continued to treat multilingual publishing as infrastructure: refining AI-assisted workflows through the babeldown package; facilitating multilingual Quarto projects through the now peer-reviewed babelquarto package; enabling our website to support multilingual content; documenting processes and sharing lessons so that other open science communities can adapt and reuse this work.

We also hosted community calls, co-working sessions, and mini-hackathons in Spanish and Portuguese, offering our content and programming to Spanish- and Lusophone-speaking community members.
These events and resources reinforce rOpenSci’s commitment to multilingual participation and to meeting communities where they are.

In 2026, we will continue working on English and Spanish content generation and expand Portuguese content and processes, piloting a fully Portuguese software peer review and translating Champions Program Training material into Portuguese.

We will continue our collaborations with other organizations and groups that translate and localize R content and offer not only content but also other aspects of our programs in languages other than English.

The preface of the Dev Guide in English, Portuguese and Spanish

rOpenSci Dev Guide Preface in three languages

The Champions Program with Latin American Focus

One of our clearest goals for 2025 was to run the first fully Spanish-language cohort of the rOpenSci Champions Program, with a focus on Latin America.

The 2025 cohort launched with an improved Spanish curriculum, Spanish-language community calls, outreach activities, and mentoring. To accomplish this, we translated all training materials and the infrastructure and artifacts necessary to run a successful cohort into Spanish. The program emphasized sustainable research software as a pillar of open science in the region, while strengthening peer networks among Champions, mentors, and alumni.

Alongside the new cohort, we also evaluated the second cohort to improve the program and better understand its impact. We conducted anonymous surveys and 1:1 interviews with Champions and mentors and used their feedback to improve the program.

The analysis of the survey and interview information shows high satisfaction among participants, both mentors and Champions. All participants agreed that the program prepared Champions to develop their own R packages, improve the quality of their packages, and participate in the software peer review. They also expressed a desire to remain connected with rOpenSci, including participating in the Champions Program again in the future as mentors.

All Champions interviewed reported positive professional outcomes after participating in the program, ranging from employment opportunities, internships, grants, and conference scholarships to presentations at international and local events, generating interest in their project and the program.

Participants also provided constructive feedback on challenges they experienced, including staying in touch with other participants, defining the scope of the project, and time zones. In response, we made adjustments to the program to try to address these challenges for the 2025-2026 cohort.

In 2026, we will have a second cohort in Spanish and will take steps toward longer-term sustainability for the program: testing new formats, strengthening partnerships, and beginning follow-up work with earlier cohorts to better understand the program’s long-term impact.

World map showing applications levels by country, with a color gradient from dark purple (1) to yellow (30). North America, parts of South America, and Australia show higher values in yellow-green, while parts of Asia, and Africa appear in darker purple tones indicating lower values.

Countries of origin of applications to the Champions Program

Community participation and pathways into open source

Throughout 2025, our blog, newsletters, and social media outreach continued to amplify community voices. These highlighted not only software releases but also the people, practices, and contexts behind them, reinforcing our focus on community participation.

During 2025, we published 49 posts on the rOpenSci blog. These included 41 blog posts and 8 technotes, with 18 posts authored or co-authored by community members, reflecting the continued openness of the blog to contributions beyond the staff team and the great participation of our community. Also, 60 authors contributed to the blog, 39 of them writing for rOpenSci for the first time, an encouraging sign of ongoing renewal and engagement!

Our content was published in four languages: 13 posts were available in more than one language, and 16 posts were published in at least one non-English language, helping us reach a broader and more diverse audience.

We organized 3 community calls and 13 co-working sessions with 21 unique presenters and more than 200 attendees, with topics ranging from technical skills such as testing, version control, and package development to broader conversations about accessibility, contribution pathways, and getting to know other communities and organizations.

One of the goals for 2025 was to create more opportunities for people to engage with rOpenSci, especially first-time contributors.

In 2025, this took shape through mini-hackathons designed to lower barriers to participation. These events combined practical contribution opportunities with mentoring and social connection, reinforcing the idea that open source work is collaborative and learned through doing. We wrote a guide to share our lessons learned organizing this type of event. Hosting multiple events with this goal throughout the year helped emphasize that open source participation is not a one-off event, but an ongoing practice that grows with continued support.

We also attended and supported several other conferences and events throughout the year, including keynotes at useR!2025 and uRos2025, among several talk and workshops from our staff and community members. These events provided opportunities to connect with community members in person, share our work, and learn from others in the open science ecosystem.

In 2026, we plan to continue organizing community calls in multiple spoken languages, co-working sessions, and mini-hackathons. If you want to collaborate in any of these spaces, please get in touch with us. We also hope to see you in some of the R, open science, and open source conferences around the world: follow our events page and our newsletter to learn which ones we will attend!

We made progress on developing a prototype dashboard that aims to help organizations maintain complex systems of interdependent software components. Our initial prototype organizational dashboard provides insights into code contributors, their repositories, and maintenance status and needs. We are in conversation with several groups about adopting and adapting the dashboard to highlight the importance of communities in creating and maintaining software.

rOpenSci Staff and Community members presenting at several international conferences

rOpenSci Staff and Community members at international conferences

Strengthening ties with open science partners

In 2025, we continued to build and strengthen partnerships with organizations that share our commitment to open science, open source, and community-driven development.

We used our co-working sessions to connect rOpenSci members with like-minded communities, holding joint events with the Data Science Learning Community and The Carpentries. We also organized a discussion and networking event for and with current and former U.S. federal government data scientists with OpenScapes.

rOpenSci staff served on the R Consortium Infrastructure Steering Committee, R-Ladies global leadership, R-Multiverse administration, CSIDNet Collaborative Committee, [The Carpentries](the carpentries board of directors) Board of Directors, as well as the UseR! 2025, 2026 and LatinR organizing committees.

Together with a coalition of open science organizations, we held a convening to map out gaps in ecosystem sustainability and to start a long-term process of developing shared support approaches. Finally, as mentioned above, we are working with BioConductor to make R-Universe a larger part of their core infrastructure.

pyOpenSci, rOpenSci, OLS, The Carpentries and PreReview Executive Teams and Community Managers

Executive Teams and Community Managers from pyOpenSci, rOpenSci, OLS, The Carpentries and PreReview at the Open Science Communities Convening in September 2025

What we learned

Looking back over 2025, we focused on being true to our core community values and delivering on commitments made the year before. Where plans evolved, they did so in response to capacity, collaboration, and community feedback.

Closing the loop between intention and action is a practice of accountability and learning. It helps us see which activities and strategies complement each other over time, which structures enable others to lead, and where care and sustainability matter most.

As we look ahead, we do so grounded in what 2025 demonstrated: that open science is built through steady, collective work, with Diversity, Equity, Inclusion, and Accessibility at the center. We believe that sharing what we plan to do has a greater impact if we return to show what we actually did, all the while staying true to the principles that hold this community together.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Closing The Loop with Our 2025 Wrap-up]]>

398488

ChatGPT’ed Monte Carlo exam https://www.r-bloggers.com/2026/01/chatgpted-monte-carlo-exam/

Wed, 21 Jan 2026 23:26:54 +0000
http://xianblog.wordpress.com/?p=61810

This semester I was teaching a graduate course on Monte Carlo methods at Paris Dauphine and I decided to experiment how helpful ChatGPT would prove in writing the final exam. Given my earlier poor impressions, I did not have great expectations and ended up definitely impressed! In total it took …

Continue reading: ChatGPT’ed Monte Carlo exam]]>
[social4i size=”small” align=”align-left”]
–>
[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

This semester I was teaching a graduate course on Monte Carlo methods at Paris Dauphine and I decided to experiment how helpful ChatGPT would prove in writing the final exam. Given my earlier poor impressions, I did not have great expectations and ended up definitely impressed! In total it took me about as long as if I had written the exam by myself, since I went through many iterations, but the outcome was well-suited for my students (or at least for what I expected from my students). The starting point was providing ChatGPT with the articles of Giles on multi-level Monte Carlo and of Jacob et al on unbiased MCMC, and the instruction to turn them into a two-hour exam. Iterations were necessary to break the questions into enough items and to reach the level of mathematical formalism I wanted. Plus add extra questions with R coding. And given the booklet format of the exam, I had to work on the LaTeX formatting (if not on the solution sheet, which spotted a missing assumption in one of my questions). Still a positive experiment I am likely to repeat for the (few) remaining exams I will have to produce!

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: ChatGPT’ed Monte Carlo exam]]>


398455

Correcting for multiplicity in the ’emmeans’ package https://www.r-bloggers.com/2026/01/correcting-for-multiplicity-in-the-emmeans-package/

Wed, 21 Jan 2026 00:00:00 +0000
https://www.statforbiology.com/2026/stat_mcp_multivariatet/

In my recent book (see below), on page 166 and earlier, I made the point that, with pairwise comparisons and, more generally, whenever simultaneous statistical tests are performed, it is necessary to provide P-values that account for the familywise …

Continue reading: Correcting for multiplicity in the ’emmeans’ package]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on R on Fixing the bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

In my recent book (see below), on page 166 and earlier, I made the point that, with pairwise comparisons and, more generally, whenever simultaneous statistical tests are performed, it is necessary to provide P-values that account for the familywise error rate, i.e. the probability of committing at least one incorrect rejection within the whole family of simultaneous tests (i.e. adjusted P-values). In this respect, it may be useful to recall that, for a single non-significant test, the comparison-wise error rate (E_c) is the probability of a wrong rejection for that single test (based on a non-adjusted P-value), whereas the probability of at least one wrong rejection within a family of (k) comparisons is much higher.

With pairwise comparisons, a single test is usually based on the ratio between a difference and its standard error (a t-test), which is assumed to follow a univariate t-distribution when the null hypothesis is true. When several simultaneous t-tests are performed, the vector of all t-ratios can be assumed to follow a multivariate t-distribution under the hypothesis that the null is true for all simultaneous tests (Bretz et al., 2011). Therefore, adjusted P-values can be obtained by using the probability function of a multivariate t-distribution in place of the simple univariate t-distribution.

As an example, let us reconsider the ‘mixture’ data used in Chapter 9 of the main book. Three herbicide mixtures and an untreated control were tested for their weed-control ability against an important weed in tomato, namely Solanum nigrum. In the code below, we load the data and fit a one-way ANOVA model, using the weight of weed plants per pot as the response variable and the herbicide treatment as the explanatory factor. For the sake of simplicity, we omit the usual checks of the basic assumptions (see the main book). The ANOVA table shows that the treatment effect is significant and, therefore, we proceed to compare treatment means in a pairwise fashion. The P-values shown below do not account for the familywise error rate but only for the comparison-wise error rate; these P-values can be reproduced by using the probability function of a univariate Student’s t-distribution (pt() function in R).

library(statforbiology)
library(emmeans)
library(multcomp)
dataset <- getAgroData("mixture")
dataset$Treat <- factor(dataset$Treat)
model <- lm(Weight ~ Treat, data = dataset)
anova(model)
## Analysis of Variance Table
##
## Response: Weight
##           Df  Sum Sq Mean Sq F value    Pr(>F)
## Treat      3 1089.53  363.18  23.663 2.509e-05 ***
## Residuals 12  184.18   15.35
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
groupMeans <- emmeans(model, ~Treat)
tab <- contrast(groupMeans, method = "pairwise", adjust = "none")
tab
##  contrast                         estimate   SE df t.ratio p.value
##  Metribuzin__348 - Mixture_378        4.05 2.77 12   1.461  0.1697
##  Metribuzin__348 - Rimsulfuron_30    -7.68 2.77 12  -2.774  0.0168
##  Metribuzin__348 - Unweeded         -17.60 2.77 12  -6.352  <.0001
##  Mixture_378 - Rimsulfuron_30       -11.73 2.77 12  -4.235  0.0012
##  Mixture_378 - Unweeded             -21.64 2.77 12  -7.813  <.0001
##  Rimsulfuron_30 - Unweeded           -9.91 2.77 12  -3.578  0.0038
#
# The P-value is obtained from the univariate t distribution (two-tails test)
abst <- abs(as.data.frame(tab)$t.ratio)
2 * pt(abst, 12, lower.tail = FALSE)
## [1] 1.696785e-01 1.683167e-02 3.651239e-05 1.157189e-03 4.782986e-06
## [6] 3.794451e-03

In order to obtain familywise error rates, we should switch from the univariate to the multivariate t-distribution. For example, let’s consider the first t-ratio in the previous Code Box (t = 1.461). We should ask ourselves: “what is the probability of obtaining a t-ratio as extreme as, or more extreme than, 1.461 from a multivariate t-distribution with six dimensions (i.e., the number of simoultaneous tests)?”. In this calculation, we must also consider that the 6 tests are correlated, at least to some extent, because they share some common elements, for example, the same error term in the denominator. In the simplest case (homoscedasticity and balanced data), this correlation is equal to 0.5 for all pairwise comparisons.

In earlier times, when the computing power was limited, calculating probabilities from the multivariate t-distribution was a daunting task. However, for some specific cases (e.g., linear models with homoscedastic and balanced data), adjusted P-values could be obtained by exploiting the distribution of the Studentised Range (the so-called ‘tukey’ method), which is the default option in the contrast() function of the emmeans package, as shown in the following Code box.

tab <- contrast(groupMeans, method = "pairwise")
# tab <- contrast(groupMeans, method = "pairwise", adjust = "tukey") # same as above
tab
##  contrast                         estimate   SE df t.ratio p.value
##  Metribuzin__348 - Mixture_378        4.05 2.77 12   1.461  0.4885
##  Metribuzin__348 - Rimsulfuron_30    -7.68 2.77 12  -2.774  0.0698
##  Metribuzin__348 - Unweeded         -17.60 2.77 12  -6.352  0.0002
##  Mixture_378 - Rimsulfuron_30       -11.73 2.77 12  -4.235  0.0055
##  Mixture_378 - Unweeded             -21.64 2.77 12  -7.813  <.0001
##  Rimsulfuron_30 - Unweeded           -9.91 2.77 12  -3.578  0.0173
##
## P value adjustment: tukey method for comparing a family of 4 estimates
# The P-value is obtained from the Studentised Range Distribution (two-tails test)
abst <- abs(as.data.frame(tab)$t.ratio)
ptukey(sqrt(2) * abst, 4, 12, lower.tail = FALSE)
## [1] 4.884620e-01 6.981178e-02 1.853807e-04 5.501451e-03 2.473776e-05
## [6] 1.725725e-02

This simple method yields exact familywise error rates with balanced data—which represent the vast majority of designed field experiments in agriculture—and performs reasonably well in the presence of small degrees of imbalance. Within the framework of traditional multiple-comparison testing procedures, the approach described above leads to the same results as Tukey’s HSD for balanced data and the Tukey–Kramer test for unbalanced data.

More recently, it has become possible to directly calculate probabilities from the multivariate t-distribution, which is particularly convenient because it provides a more general approach to obtaining familywise error rates. This distribution is implemented in the ‘mvtnorm’ package through the pmvt() function. To perform the calculation, we must specify, for each dimension, the interval over which the probability is to be computed (in this case, for the first t-ratio, the interval is (pm 1.461081)), the number of degrees of freedom (12), and the correlation matrix of the linear combinations, which can be directly retrieved from the ‘emmGrid’ object. The code below illustrates these calculations. The quantity ‘plev’ represents the probability of sampling within the interval (i.e. none of the six null hypotheses is wrongly rejected), whereas the familywise error rate corresponds to the probability of sampling outside the interval (i.e. at least one null hypothesis is wrongly rejected), which is obtained by subtraction.

library(mvtnorm)
t1 <- abs(as.data.frame(tab)$t.ratio)[1]
ncontr <- 6
corMat <- cov2cor(vcov(tab))
plev <- pmvt(lower = rep(-t1, ncontr), upper=rep(t1, ncontr), df = 12,
     corr = corMat)[1]
1 - plev
## [1] 0.4883843

In R, such an approach can be obtained by using the adjust = "mvt" argument.

tab <- contrast(groupMeans, method = "pairwise", adjust = "mvt")
tab
##  contrast                         estimate   SE df t.ratio p.value
##  Metribuzin__348 - Mixture_378        4.05 2.77 12   1.461  0.4885
##  Metribuzin__348 - Rimsulfuron_30    -7.68 2.77 12  -2.774  0.0698
##  Metribuzin__348 - Unweeded         -17.60 2.77 12  -6.352  0.0002
##  Mixture_378 - Rimsulfuron_30       -11.73 2.77 12  -4.235  0.0054
##  Mixture_378 - Unweeded             -21.64 2.77 12  -7.813  <.0001
##  Rimsulfuron_30 - Unweeded           -9.91 2.77 12  -3.578  0.0172
##
## P value adjustment: mvt method for 6 tests

The above function employs numerical integration methods and is based on simulation; consequently, the results are not fully reproducible. However, it is easy to see that these results are asymptotically equivalent to those obtained with the Tukey adjustment method shown above. Owing to this intrinsic complexity, the use of the adjust = "mvt" argument is not recommended for pairwise comparisons in balanced experiments, whereas it may prove useful in other situations, for example in the presence of strongly unbalanced data.

Thanks for reading—and don’t forget to check out my new book below!

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Send comments to: andrea.onofri@unipg.it

Book cover

References

  1. Bretz, F., Hothorn, T., Westfall, P., 2011. Multiple comparisons using R. CRC Press, Boca Raton, FL.
To leave a comment for the author, please follow the link and comment on their blog: R on Fixing the bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Correcting for multiplicity in the ’emmeans’ package]]>

398449


Why Submit to AI in Production: Speaking as a Tool for Better Work https://www.r-bloggers.com/2026/01/why-submit-to-ai-in-production-speaking-as-a-tool-for-better-work/

Tue, 20 Jan 2026 23:59:00 +0000
https://www.jumpingrivers.com/blog/why-submit-ai-in-production/

We’re accepting abstracts for AI in Production until 23rd January. The conference takes place on 4th–5th June 2026 in Newcastle, with talks on Friday 5th across two streams: one focused on engineering and production systems, the other on ma…

Continue reading: Why Submit to AI in Production: Speaking as a Tool for Better Work]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

We’re accepting abstracts for AI in Production until 23rd January. The conference takes place on 4th–5th June 2026 in Newcastle, with talks on Friday 5th across two streams: one focused on engineering and production systems, the other on machine learning and model development.

We often hear: “My work isn’t ready to talk about yet” or “I’m not sure anyone would be interested.” We want to address that hesitation directly.

Speaking at a conference isn’t primarily about promoting yourself or your organisation.

It’s a practical tool that helps you do better work. Preparing and delivering a talk forces useful reflection, invites feedback from people facing similar challenges, and turns knowledge that lives only in your head into something your team can reuse.

If you’re wondering whether your work qualifies: internal systems count, work in progress counts, partial success counts.

Submit your abstract by 23rd January on the AI in Production website.

Preparing a Talk Clarifies Your Decisions

When you sit down to explain a technical choice to an audience, you have to answer questions you might have glossed over at the time: Why did we build it this way? What constraints shaped our approach? What would we do differently now?

This isn’t about justifying your decisions to others. It’s about understanding them yourself. The process of turning a production system into a coherent narrative forces you to see patterns you were too close to notice while building it. You identify what worked, what didn’t, and why. That clarity is valuable whether or not you ever give the talk.

Many practitioners find that writing an abstract or outline reveals gaps in their thinking. A deployment strategy that seemed obvious in context becomes harder to explain without it. A monitoring approach that felt pragmatic reveals underlying assumptions. This friction is useful. It means you’re learning something about your own work.

Speaking Invites Useful Feedback

The audience at AI in Production will broadly fall across two streams: engineering (building, shipping, maintaining, and scaling systems) and machine learning (model development, evaluation, and applied ML).

Whether you’re working on infrastructure and deployment or on training pipelines and model behaviour, you’ll be in a room with people facing similar constraints: limited resources, shifting requirements, imperfect data, and operational pressures.

When you share what you’ve tried, you get feedback from people who understand the context. Someone has solved a similar problem differently. Someone has run into the same failure mode. Someone asks a question that makes you reconsider an assumption.

This kind of peer feedback is hard to get otherwise. Your team is too close to the work. Online discussions lack context. A conference talk puts your approach in front of people who can offer informed perspectives without having to understand your entire stack or organisational structure first.

Talks Help Share Responsibility and Knowledge

In many teams, knowledge about production systems sits with one or two people. They know why certain decisions were made, where the edge cases are, and how to interpret the monitoring dashboards. That concentration of knowledge creates risk.

Preparing a talk is a forcing function for documentation. To explain your system to strangers, you have to articulate what’s currently tacit. That articulation becomes something your team can use: onboarding material, decision records, runbooks.

Speaking also distributes responsibility. When you present work publicly, it stops being just yours. Your team shares ownership of the ideas. Others can critique, extend, or maintain them. This is particularly valuable for platform teams or infrastructure work, where the people who built something may not be the ones operating it six months later.

Turning Tacit Knowledge into Reusable Material

Much of what you know about your production systems isn’t written down. You understand the failure modes, the workarounds, and the operational quirks. You know which metrics matter and which are noise. You remember why you made certain tradeoffs.

A conference talk is an excuse to capture that knowledge. The slides become a reference. The abstract becomes a design document. The Q&A reveals what wasn’t clear and needs better documentation.

Even if the talk itself is ephemeral, the process of preparing it leaves artefacts. You’ve already done the hard work of running the system. Speaking about it turns that experience into something others can learn from, and you can build on.

Your Work Is Worth Sharing

If you’re maintaining AI systems in production, you’re solving problems worth talking about. Making models reliable under load, keeping training pipelines maintainable, monitoring behaviour when ground truth is delayed or absent, and managing technical debt while shipping features.

These are the problems practitioners face every day. Your approach won’t be perfect, and that’s the point. Talks about work in progress, about things that didn’t work, about compromises made under constraint are often more useful than polished success stories.

We’re looking for honest accounts of how people are actually building and operating AI systems. That might fit the engineering stream (deployment, infrastructure, monitoring, scaling) or the machine learning stream (training, evaluation, model behaviour, responsible data use). If you’re doing work in either area, you have something to contribute.

Submit an Abstract

The deadline is 23rd January. You’ll need a title and an abstract of up to 250 words. You don’t need a perfect story or a finished project. You need a problem you’ve worked on, some approaches you’ve tried, and some lessons you’ve learned.

Think about what would be useful for someone six months behind you on a similar path. Think about what you wish someone had told you before you started. Think about the conversation you’d want to have with peers who understand the constraints you’re working under.

If you’re not sure where to start, consider writing about one decision that shaped your system, one assumption that turned out to be wrong, or one constraint that changed your design. Good abstracts often start with a specific moment or choice rather than a broad overview.

Ready to submit? The deadline is 23rd January. Share one decision, one lesson, or one constraint from your production work:
https://jumpingrivers.com/ai-production/

If you have questions about whether your work fits the conference, reach out at events@jumpingrivers.com. We’re here to help make this easier.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Why Submit to AI in Production: Speaking as a Tool for Better Work]]>

398424


Bioinformatics Analysis on Posit Connect Cloud with freeCount https://www.r-bloggers.com/2026/01/bioinformatics-analysis-on-posit-connect-cloud-with-freecount/

Tue, 20 Jan 2026 04:52:28 +0000
http://morphoscape.wordpress.com/?p=3181

The easiest way to use the freeCount R Shiny applications online is through Posit Connect Cloud, which is an online platform that simplifies the deployment of data applications and documents.

Continue reading: Bioinformatics Analysis on Posit Connect Cloud with freeCount]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on R – Myscape, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Overview

The easiest way to use the freeCount R Shiny applications online is through Posit Connect Cloud, which is an online platform that simplifies the deployment of data applications and documents.

freeCount

The freeCount analysis framework provides a modular set of tools and tutorials for a structured approach to biological count data analysis. Users are guided through common data assessment, processing and analysis approaches.

The different analysis tools currently available include:

Steps

The following steps will walk you through how to run the freeCount apps online using Posit Connect Cloud.

  1. Navigate to https://connect.posit.cloud/elizabethbrooks?search=freeCount
  2. Select the app that you want to run

    Wait for the app to launch.

    Done! Now you are able to perform the selected analysis.


Step 1

Navigate to https://connect.posit.cloud/elizabethbrooks?search=freeCount

Step 2

Select the app that you want to run and click the name or image to open.

Wait…

Wait for the project to deploy in your Posit Cloud workspace.

Done!

Now you are able to perform the selected analysis.


Analysis Tutorials

The freeCount apps provide a set of common tools for analyzing biological data, including differential expression and network analysis. We have tutorials available to guide users through a structured analysis approach:

To leave a comment for the author, please follow the link and comment on their blog: R – Myscape.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Bioinformatics Analysis on Posit Connect Cloud with freeCount]]>






398421


Introducing distionary for Building and Probing Distributions https://www.r-bloggers.com/2026/01/introducing-distionary-for-building-and-probing-distributions/

Tue, 20 Jan 2026 00:00:00 +0000
https://ropensci.org/blog/2026/01/20/introducing-distionary/

After passing through rOpenSci peer review, the distionary package is now newly available on CRAN.
It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.
These d…

Continue reading: Introducing distionary for Building and Probing Distributions]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

After passing through rOpenSci peer review, the distionary package is now newly available on CRAN. It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.

These distributions form the building blocks that piece together advanced statistical models with the wider probaverse ecosystem, which is built to release modelers from low-level coding so production pipelines stay human-friendly. Right now, the other probaverse packages are distplyr, allowing you to morph distributions into new forms, and famish, allowing you to tune distributions to data. Developed with risk analysis use cases like climate and insurance in mind, the same tools translate smoothly to simulations, teaching, and other applied settings.

This post highlights the top 3 features of this youngest version of distionary. Let’s start by loading the package.

library(distionary)

Feature 1: more than just Base R distributions

Of course, all the Base R distributions are available in distionary. Here’s everyone’s favourite Normal distribution.

dst_norm(0, 1)

Normal distribution (continuous)
--Parameters--
mean sd
0 1

plot(dst_norm(0, 1))
Normal distribution density.

And good old Poisson.

dst_pois(3)

Poisson distribution (discrete)
--Parameters--
lambda
3

plot(dst_pois(3))
Poisson distribution probability mass function.

But there are additional game-changing distributions included, too.

A Null distribution, which always evaluates to NA. When you’re running an algorithm that encounters an issue, you can return a Null distribution instead of throwing an error. Even downstream evaluation steps won’t error out because the code still sees a distribution rather than a bare NA or NULL.

# Make a Null distribution.
null <- dst_null()
# Null distributions always evaluate to NA.
eval_quantile(null, at = c(0.25, 0.5, 0.75))

[1] NA NA NA

mean(null)

[1] NA

Empirical distributions, where the data are the distribution. These respect observed behaviour without forcing a specific shape, and are also commonly used as a benchmark for comparison against other models. Here’s an example using the Ozone concentration from the airquality dataset that comes loaded with R.

# Empirical distribution of Ozone from the `airquality` dataset.
emp <- dst_empirical(airquality$Ozone, na_action_y = "drop")
# Inspect
print(emp, n = 5)

Finite distribution (discrete)
--Parameters--
# A tibble: 67 × 2
outcomes probs
<int> <dbl>
1 1 0.00862
2 4 0.00862
3 6 0.00862
4 7 0.0259
5 8 0.00862
# ℹ 62 more rows

Compare its cumulative distribution function (CDF) to that of a Gamma distribution fitted to the Ozone levels, borrowing the probaverse’s famish package for the fitting task.

# Fit a Gamma distribution to Ozone using the famish package.
library(famish)
gamma <- fit_dst_gamma(airquality$Ozone, na_action = "drop")

# Plot the cumulative distribution functions (CDFs) together.
plot(emp, "cdf", n = 1000, xlab = "Ozone Levels (ppb)")
plot(gamma, "cdf", add = TRUE, col = "red")
legend(
 "bottomright",
 legend = c("Empirical", "Fitted Gamma"),
 col = c("black", "red"),
 lty = 1
)
Comparison of Empirical CDF and fitted Gamma CDF for Ozone levels.

These textbook distributions become much more useful once they become building blocks for building up a system. For example, they could form predictive distributions in a machine learning context, or be related to other variables. This is what the probaverse seeks to make possible.

Feature 2: friendly towards tidy tabular workflows

First, load the tidyverse to activate tidy tabular workflows. And yes, probaverse is named after the tidyverse because it aims to be a “tidyverse for probability”.

library(tidyverse)

You can safely ignore this next chunk unless you want to see how I’m wrangling some financial data for you.

# Wrangle the stocks data frame using tidyverse.
stocks <- as_tibble(EuStockMarkets) |>
 mutate(across(everything(), (x) 100 * (1 - x / lag(x)))) |>
 drop_na()

The stocks data I’ve wrangled is a table of daily percent loss for four major European stock indices. The dates don’t matter for this example, so they’ve been omitted.

stocks

# A tibble: 1,859 × 4
DAX SMI CAC FTSE
<dbl> <dbl> <dbl> <dbl>
1 0.928 -0.620 1.26 -0.679
2 0.441 0.586 1.86 0.488
3 -0.904 -0.328 0.576 -0.907
4 0.178 -0.148 -0.878 -0.579
5 0.467 0.889 0.511 0.720
6 -1.25 -0.676 -1.18 -0.855
7 -0.578 -1.23 -1.32 -0.824
8 0.287 0.358 0.193 -0.0837
9 -0.637 -1.11 -0.0171 0.522
10 -0.118 -0.437 -0.314 -1.41
# ℹ 1,849 more rows

First, let’s focus on the DAX stock index. Fit an empirical distribution like last time (notice I’m using a data mask1 in dst_empirical() this time).

# Fit an empirical distribution to the DAX stock index.
dax <- dst_empirical(DAX, data = stocks)
# Inspect the CDF.
plot(dax, xlab = "Daily Loss (%)")
Empirical CDF of DAX stock index daily losses.

You can easily calculate some standard quantiles in tabular format so that the inputs are placed alongside the calculated outputs: just use the enframe_ prefix instead of eval_ as we did above with the Null distribution.

enframe_quantile(dax, at = c(0.25, 0.5, 0.75), arg_name = "prob")

# A tibble: 3 × 2
prob quantile
<dbl> <dbl>
1 0.25 -0.638
2 0.5 -0.0473
3 0.75 0.468

Or, more to the point here – and appealing to probaverse’s soft spot for risk-focused work – you can calculate return levels (also known as “Value at Risk” in financial applications) for specific return periods. If you don’t know what these are, they are just fancy names for quantiles.

return_periods <- c(5, 50, 100, 200, 500)
enframe_return(
 dax,
 at = return_periods,
 arg_name = "return_period",
 fn_prefix = "daily_loss_pct"
)

# A tibble: 5 × 2
return_period daily_loss_pct
<dbl> <dbl>
1 5 0.621
2 50 2.17
3 100 2.75
4 200 3.08
5 500 3.71

The tabular output becomes even more powerful when inserted into a table of models, because it facilitates comparisons and trends. To demonstrate, build a model for each stock. First, lengthen the data for this task.

# Lengthen the data using tidyverse.
stocks2 <- pivot_longer(
 stocks,
 everything(),
 names_to = "stock",
 values_to = "daily_loss_pct"
)
# Inspect
stocks2

# A tibble: 7,436 × 2
stock daily_loss_pct
<chr> <dbl>
1 DAX 0.928
2 SMI -0.620
3 CAC 1.26
4 FTSE -0.679
5 DAX 0.441
6 SMI 0.586
7 CAC 1.86
8 FTSE 0.488
9 DAX -0.904
10 SMI -0.328
# ℹ 7,426 more rows

Build a model for each stock using a group_by + summarise workflow from the tidyverse (please excuse the current need to wrap the distribution in list()). Notice that distributions become table entries, indicated here by their class <dst>.

# Create an Empirical distribution for each stock.
models <- stocks2 |>
 group_by(stock) |>
 summarise(model = list(dst_empirical(daily_loss_pct)))
# Inspect
models

# A tibble: 4 × 2
stock model
<chr> <list>
1 CAC <dst>
2 DAX <dst>
3 FTSE <dst>
4 SMI <dst>

Now you can use a tidyverse workflow to calculate tables of quantiles for each model, and expand them. In fact, this workflow is common enough that I’m considering adding a dedicated verb for it.

return_levels <- models |>
 mutate(
 df = map(
 model,
 enframe_return,
 at = return_periods,
 arg_name = "return_period",
 fn_prefix = "daily_loss_pct"
 )
 ) |>
 unnest(df) |>
 select(!model)
# Inspect
print(return_levels, n = Inf)

# A tibble: 20 × 3
stock return_period daily_loss_pct
<chr> <dbl> <dbl>
1 CAC 5 0.757
2 CAC 50 2.37
3 CAC 100 2.78
4 CAC 200 3.41
5 CAC 500 3.97
6 DAX 5 0.621
7 DAX 50 2.17
8 DAX 100 2.75
9 DAX 200 3.08
10 DAX 500 3.71
11 FTSE 5 0.542
12 FTSE 50 1.58
13 FTSE 100 2.05
14 FTSE 200 2.31
15 FTSE 500 2.87
16 SMI 5 0.552
17 SMI 50 2.03
18 SMI 100 2.52
19 SMI 200 2.91
20 SMI 500 3.55

The result is a tidy dataset that’s ready for most analyses. For example, you can easily plot a comparison of the return levels of each stock. I make these plots all the time to facilitate risk-informed decision-making.

return_levels |>
 mutate(stock = fct_reorder2(stock, return_period, daily_loss_pct)) |>
 ggplot(aes(return_period, daily_loss_pct, colour = stock)) +
 geom_point() +
 geom_line() +
 theme_bw() +
 scale_x_log10(
 "Return Period (days)",
 minor_breaks = c(1:10, 1:10 * 10, 1:10 * 100)
 ) +
 scale_y_continuous("Daily Loss", label = scales::label_number(suffix = "%")) +
 annotation_logticks(side = "b") +
 scale_colour_discrete("Stock Index")
Return Level Plot for Daily Loss Percentages of Stock Indices.

Feature 3: make the distribution you need

You can create your own distributions with distionary by specifying only a minimal set of properties; all other properties are derived automatically and can be retrieved when needed.

Let’s say you need an Inverse Gamma distribution but it’s not available in distionary. Currently, distionary assumes you’ll at least provide the density and CDF; you could retrieve these from the extraDistr package (functions dinvgamma() and pinvgamma()). Plug them into distionary’s distribution() function and enjoy access to a variety of properties you didn’t specify, like the mean, variance, skewness, and hazard function.

# Make an Inverse Gamma distribution (minimal example).
ig <- distribution(
 density = function(x) extraDistr::dinvgamma(x, alpha = 5, beta = 20),
 cdf = function(x) extraDistr::pinvgamma(x, alpha = 5, beta = 20),
 .vtype = "continuous",
)
# Calculate anything.
mean(ig)

[1] 5

variance(ig)

[1] 8.333333

skewness(ig)

[1] 3.464085

plot(ig, "hazard", to = 20, n = 1000, xlab = "Outcome")
Hazard function of an Inverse Gamma distribution.

You might also consider giving the distribution a .name – it pays off when you’re juggling multiple models. Adding .parameters provides additional specificity to the distribution .name, but are otherwise not yet used for functional purposes.

Here is a more complete implementation of the Inverse Gamma distribution, this time implemented as a function of the two parameters. Notice I also check that the parameters are positive (cheers to the checkmate package).

dst_invgamma <- function(alpha, beta) {
 checkmate::assert_number(alpha, lower = 0)
 checkmate::assert_number(beta, lower = 0)
 distribution(
 density = (x) extraDistr::dinvgamma(x, alpha = alpha, beta = beta),
 cdf = (x) extraDistr::pinvgamma(x, alpha = alpha, beta = beta),
 quantile = (p) extraDistr::qinvgamma(p, alpha = alpha, beta = beta),
 random = (n) extraDistr::rinvgamma(n, alpha = alpha, beta = beta),
 .name = "Inverse Gamma",
 .parameters = list(alpha = alpha, beta = beta),
 .vtype = "continuous",
 )
}

Now we can make that same Inverse Gamma distribution as before:

ig2 <- dst_invgamma(5, 20)
# Inspect
ig2

Inverse Gamma distribution (continuous)
--Parameters--
alpha beta
5 20

By the way, this feature – being able to inspect other distribution properties even when they are not specified – is great for learning about probability. That’s because you can see the many ways distributions can be represented, not just by the usual density or probability mass functions seen in textbooks.

This feature also allows for extensibility of the probaverse. For example, the probaverse’s distplyr package creates mixture distributions, which do not have an explicit formula for the quantile function. However, this is not problematic – the distribution can still be defined, and distionary will figure out what the quantiles are.

What’s to come?

Currently, the distionary package provides key functionality to define and evaluate distribution objects. Future goals include:

If this excites you, join the conversation by opening an issue or contributing.

Special thanks to the rOpenSci reviewers Katrina Brock and Christophe Dutang for insightful comments that improved this package. Also thanks to BGC Engineering Inc., the R Consortium, and the European Space Agency together with the Politecnico di Milano for supporting this project.


  1. Meaning I’m referring directly to the column ‘DAX’ without stocks$ as in our above examples. ↩

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Introducing distionary for Building and Probing Distributions]]>

398433


Pharmaverse and Containers https://www.r-bloggers.com/2026/01/pharmaverse-and-containers/

Sun, 18 Jan 2026 00:00:00 +0000
https://pharmaverse.github.io/blog/posts/2026-01-18_pharmaverse_containers/pharmaverse_and__containers.html

Streamlining Our Pharmaverse Blog: Reducing Publishing Time with containers
As an active contributor to the pharmaverse blog, I’ve always appreciated the opportunity to share new insights and tools with our community. The pharmaverse blog has…

Continue reading: Pharmaverse and Containers]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Streamlining Our Pharmaverse Blog: Reducing Publishing Time with containers

As an active contributor to the pharmaverse blog, I’ve always appreciated the opportunity to share new insights and tools with our community. The pharmaverse blog has some interesting features for publishing. While the blog publishing process has been effective, I was wondering if there was a way to optimize our workflows. Currently, the CI/CD pipeline for publishing the blog typically took about 17 minutes to deploy a new post. Containers are always batted around as solutions. Unfortunately, I was unsure how to create a new container/image to meet my needs and relied on some straight install.packages() in the CI/CD pipeline. Luckily, I crossed paths with the fabulous Maciej Nasinski and we built a specific container for the publishing process for the pharmaverse blog, which allowed for a notable reduction in publishing time.

Below I will discuss how the pharmaverse container image has improved our blog’s publishing workflow, bringing our deployment time down to approximately 5 minutes. We are also interested in feedback on potential other uses of this container (like devcontainers) or building additional containers for certain purposes. For those interested, we would be happy to provide a tutorial on containers or get in touch if you have ideas or want to get involved!

The Previous Approach: Package Installation overhead

Our prior CI/CD (GitHub Actions) workflow for building and deploying the pharmaverse blog, while comprehensive, included a time-consuming step. It used the a straight “Install dependencies” step from the r-lib actions, which necessitated installing a range of pharmaverse specific R packages during each run. As we added more packages to the pharmaverse to the blog, this became really cumbersome!

The relevant section of our old build-deploy job highlighted this:

  build-deploy:
    # ... other configurations ...
    steps:
      # ... checkout and setup Quarto ...
      - name: Setup R
        uses: r-lib/actions/setup-r@v2
        with:
          use-public-rspm: true

      - name: Install dependencies
        uses: r-lib/actions/setup-r-dependencies@v2
        with:
          packages: |
            jsonlite
            tidyverse
            spelling
            janitor
            diffdf
            admiral
            admiralonco
            # ... and many, many more packages ...
            haven
      # ... other steps like install tinytex and publish ...

This “Install dependencies” step, involving a substantial list of pharmaverse packages, was a primary contributor to the 17-minute execution time. Each workflow run involved downloading and configuring these packages, extending the overall deployment duration.

Adopting the pharmaverse container image: An Efficient Alternative

The solution to this challenge came with the introduction of the pharmaverse container image: ghcr.io/pharmaverse/docker_pharmaverse:4.5.1. This container image was specifically designed for pharmaceutical data analysis, featuring over 40 essential pharmaverse packages pre-installed. These packages cover various functionalities, including CDISC ADaM/SDTM, clinical trial reporting, and regulatory submissions. Built upon the rocker/tidyverse image and incorporating R 4.5.1, it provides a pre-configured environment.

By integrating this image into our CI/CD pipeline (GitHub Actions), we could bypass the extensive package installation phase.

Here’s an overview of our updated build-deploy job:

  build-deploy:
    needs: Update-post-dates
    runs-on: ubuntu-latest
    container:
      image: "ghcr.io/pharmaverse/docker_pharmaverse:4.5.1"
    permissions:
      contents: write
    steps:
      - name: Check out repository
        uses: actions/checkout@v4

      - name: Configure Git safe directory
        run: git config --global --add safe.directory /__w/blog/blog

      - name: Set up Quarto
        uses: quarto-dev/quarto-actions/setup@v2
        with:
            version: 1.9.12

      # NOTE: Explicit R package installation is no longer required here.

      - name: Install tinytex
        run: quarto install tool tinytex

      - name: Mark repo directory as safe
        run: git config --global --add safe.directory /__w/blog/blog

      - name: Publish
        uses: quarto-dev/quarto-actions/publish@v2
        with:
          path: . # Path to your .qmd file
          target: gh-pages  # Target branch for GitHub Pages
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The outcome of this change has been a consistent reduction in publishing time. Our blog now publishes in approximately 5 minutes. Yay!

Wider Applications of the Pharmaverse Container Image

While this update directly benefits our blog’s publishing, the pharmaverse container image offers broader utility. It is a valuable resource for:

  • Pharmaceutical data analysis: Supporting tasks related to ADaM, SDTM, and TLFs.
  • Clinical trial programming: Assisting with regulatory submissions.
  • Reproducible research: Providing a consistent and verifiable environment.
  • Training & education: Offering a ready-to-use setup for educational purposes.
  • CI/CD pipelines: Enhancing automated testing and reporting, as observed in our own workflow.
  • Collaborative development: Facilitating consistent environments across teams.

Example: Running Reproducible RStudio Locally

To spin up a local RStudio instance using the pharmaverse image, you can use a container “one-liner.” This command condenses the entire configuration – networking, storage, and background execution – into a single execution string. It bridges your local folder to the container, allowing you to edit files “on the fly” without installing R or packages on your actual machine.

Choose the command that matches your operating system:

1. Linux & Intel Macs (Standard)

For standard architecture, we set a simple password (rstudio) and mount the current directory.

docker run -d --name my_pharma_rstudio --rm -p 8787:8787 -e PASSWORD=rstudio -v "$(pwd)":/home/rstudio/project ghcr.io/pharmaverse/docker_pharmaverse:4.5.1

2. Apple Silicon (M1/M2/M3)

Note: We do not yet natively support ARM64 architecture (e.g., Apple Silicon). The command below uses emulation (--platform linux/amd64) to run the image. We also disable authentication because the slower emulation speed can sometimes cause password setup to time out.

docker run -d --name my_pharma_rstudio --rm --platform linux/amd64 -p 8787:8787 -e DISABLE_AUTH=true -v "$(pwd)":/home/rstudio/project ghcr.io/pharmaverse/docker_pharmaverse:4.5.1

What does this command do?

  • **--rm & -d**: Runs the container in the background (detached) and automatically removes it when stopped, keeping your machine clean.
  • -v "$(pwd)":...: Takes your Present Working Directory and mounts it inside the container. Any file you save in RStudio’s “project” folder is instantly saved to your local computer.
  • --platform linux/amd64 (Mac only): Forces your computer to emulate the Intel architecture required by the image.
  • -e DISABLE_AUTH=true (Mac only): Bypasses the login screen to ensure immediate access despite slower emulation speeds.

Accessing RStudio

Once the command is running:

  1. Open your browser to http://localhost:8787.
  2. Linux/Intel: Log in with user rstudio and password rstudio.
  3. Apple Silicon: You will be logged in automatically.

You will see your local files mapped to the project folder in the Files pane, ready for analysis.

Engaging with the Pharmaverse Community

The pharmaverse container image mage represents a collaborative effort within the life sciences industry to provide open-source tools. For those working with pharmaverse packages and R in development, research, or CI/CD contexts, exploring this image may offer practical advantages in efficiency and reproducibility.

Explore the pharmaverse container image repository: pharmaverse/docker_pharmaverse Discover more about Pharmaverse: pharmaverse.org

We acknowledge the contributions of the pharmaverse community and the Rocker Project for their support in developing these resources.

Last updated

2026-01-18 22:21:19.079894

Details

Reuse

Citation

BibTeX citation:
@online{straub_and_maciej_nasinski2026,
  author = {Straub and Maciej Nasinski, Ben},
  title = {Pharmaverse and {Containers}},
  date = {2026-01-18},
  url = {https://pharmaverse.github.io/blog/posts/2026-01-18_pharmaverse_containers/pharmaverse_and__containers.html},
  langid = {en}
}
For attribution, please cite this work as:
Straub and Maciej Nasinski, Ben. 2026. “Pharmaverse and Containers.” January 18, 2026. https://pharmaverse.github.io/blog/posts/2026-01-18_pharmaverse_containers/pharmaverse_and__containers.html.
To leave a comment for the author, please follow the link and comment on their blog: pharmaverse blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Pharmaverse and Containers]]>

398381


Volleyball Analytics with R: The Complete Guide to Match Data, Sideout Efficiency, Serve Pressure, Heatmaps, and Predictive Models https://www.r-bloggers.com/2026/01/volleyball-analytics-with-r-the-complete-guide-to-match-data-sideout-efficiency-serve-pressure-heatmaps-and-predictive-models/

Sat, 17 Jan 2026 18:38:12 +0000
https://rprogrammingbooks.com/?p=2407

Volleyball Analytics Volleyball Analytics with R: A Practical, End-to-End Playbook Build a full volleyball analytics workflow in R: data collection, cleaning, scouting reports, skill KPIs, rotation/lineup analysis, sideout & transition, serve/receive, visualization, dashboards, and predictive modeling. Table of Contents Why Volleyball Analytics (and Why R) Volleyball Data Model: Events, …

Continue reading: Volleyball Analytics with R: The Complete Guide to Match Data, Sideout Efficiency, Serve Pressure, Heatmaps, and Predictive Models]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Blog – R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Volleyball Analytics

Volleyball Analytics with R: A Practical, End-to-End Playbook

Build a full volleyball analytics workflow in R: data collection, cleaning, scouting reports, skill KPIs, rotation/lineup analysis, sideout & transition, serve/receive, visualization, dashboards, and predictive modeling.

Why Volleyball Analytics (and Why R)

Volleyball is a sequence of discrete events (serve, pass, set, attack, block, dig) organized into rallies and phases (sideout vs. transition). This structure makes it ideal for: event-based analytics, rotation analysis, scouting tendencies, expected efficiency modeling, and win probability.

R excels at this because of tidy data workflows (dplyr/tidyr), great visualization (ggplot2), modern modeling (tidymodels, brms), and easy reporting (Quarto/R Markdown). If you want a repeatable volleyball analytics pipeline for your club or team, R is a perfect fit.

Keywords you should care about

  • Sideout % (SO%), Break Point % (BP%), Transition Efficiency
  • Serve Pressure, Passing Rating, First Ball Sideout
  • Attack Efficiency (kills – errors)/attempts, Kill Rate
  • Rotation Efficiency, Lineup Net Rating, Setter Distribution
  • Expected Sideout, Expected Point, Win Probability
  • Scouting Tendencies, Shot Charts, Serve Target Heatmaps

Volleyball Data Model: Events, Rally, Set, Match

A practical volleyball dataset in R usually includes one row per contact or one row per event. The minimum columns for serious analytics:

  • match_id, set_no, rally_id, point_won_by
  • team, player, skill (serve, pass, set, attack, block, dig)
  • evaluation (e.g., error, poor, ok, good, perfect, kill, continuation)
  • start_zone, end_zone (serve zones, attack zones)
  • rotation, server, receive_formation
  • score_home, score_away, home_team, away_team

R code: create a minimal event schema

library(tidyverse)
library(lubridate)

event_schema <- tibble::tibble(
  match_id = character(),
  datetime = ymd_hms(character()),
  set_no = integer(),
  rally_id = integer(),
  home_team = character(),
  away_team = character(),
  team = character(),        # team performing the action
  opponent = character(),    # opponent of team
  player = character(),
  jersey = integer(),
  skill = factor(levels = c("serve","pass","set","attack","block","dig","freeball")),
  evaluation = character(),  # e.g., "error","ace","perfect","positive","negative","kill","blocked","dig"
  start_zone = integer(),    # 1..6 (or 1..9 depending system)
  end_zone = integer(),
  rotation = integer(),      # 1..6
  phase = factor(levels = c("sideout","transition")),  # derived later
  score_team = integer(),    # score for team at time of event
  score_opp  = integer(),
  point_won_by = character(), # which team won rally point
  stringsAsFactors = FALSE
)

glimpse(event_schema)

You can extend this schema with positional labels (OH, MB, OPP, S, L), contact order (1st/2nd/3rd), attack tempo, block touches, etc.

Data Sources: Manual Logs, Video Tags, DataVolley-Style Exports

Volleyball data typically arrives as: (1) manual spreadsheets, (2) video tagging exports, or (3) scouting software exports. Regardless of source, your R pipeline should:

  1. Import raw data
  2. Normalize team/player names
  3. Create rally keys (match_id/set_no/rally_id)
  4. Derive phases (sideout vs. transition)
  5. Compute KPIs and reporting tables

R code: robust import helpers

library(readr)
library(janitor)

read_events_csv <- function(path) {
  readr::read_csv(path, show_col_types = FALSE) %>%
    janitor::clean_names() %>%
    mutate(
      set_no = as.integer(set_no),
      rally_id = as.integer(rally_id),
      start_zone = as.integer(start_zone),
      end_zone = as.integer(end_zone),
      rotation = as.integer(rotation)
    )
}

normalize_names <- function(df) {
  df %>%
    mutate(
      team = str_squish(str_to_title(team)),
      opponent = str_squish(str_to_title(opponent)),
      player = str_squish(str_to_title(player)),
      evaluation = str_squish(str_to_lower(evaluation)),
      skill = factor(str_to_lower(skill),
                    levels = c("serve","pass","set","attack","block","dig","freeball"))
    )
}

Tip for SEO + practice: call your columns and metrics consistently across posts: SO%, BP%, ACE%, ERR%, Kill%, Eff%, Pos%.

R Project Setup & Reproducibility

Serious volleyball analytics needs reproducibility: same input data, same R version, same packages, same outputs. Use an R project + renv + Quarto.

R code: create a project scaffold

# Run once inside your project
install.packages(c("renv","quarto","tidyverse","lubridate","janitor","gt","patchwork","tidymodels"))

renv::init()

# Recommended folder structure
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)
dir.create("R", showWarnings = FALSE)
dir.create("reports", showWarnings = FALSE)
dir.create("figures", showWarnings = FALSE)

R code: create a simple metric dictionary

metric_dictionary <- tribble(
  ~metric, ~definition,
  "SO%", "Sideout percentage: points won when receiving serve / total receive opportunities",
  "BP%", "Break point percentage: points won when serving / total serving opportunities",
  "Kill%", "Kills / attack attempts",
  "Eff%", "(Kills - Errors) / attempts",
  "Ace%", "Aces / total serves",
  "Err%", "Serve errors / total serves"
)

metric_dictionary

Import & Clean Volleyball Event Data

Most problems in volleyball analytics are data quality problems: inconsistent team names, missing rally keys, duplicated rows, weird evaluation labels, or mixed zone definitions.

R code: import + normalize + validate

events_raw <- read_events_csv("data/raw/events.csv")
events <- events_raw %>% normalize_names()

# Basic validation
stopifnot(all(c("match_id","set_no","rally_id","team","skill","evaluation") %in% names(events)))

# Remove obvious duplicates (same match/set/rally/team/player/skill)
events <- events %>%
  distinct(match_id, set_no, rally_id, team, player, skill, evaluation, .keep_all = TRUE)

# Ensure opponent field exists
events <- events %>%
  mutate(opponent = if_else(is.na(opponent) | opponent == "",
                            NA_character_, opponent))

# Quick data quality report
quality_report <- list(
  n_rows = nrow(events),
  n_matches = n_distinct(events$match_id),
  missing_player = mean(is.na(events$player) | events$player == ""),
  missing_zone = mean(is.na(events$start_zone)),
  skill_counts = events %>% count(skill, sort = TRUE)
)

quality_report

R code: derive rally winner and rally phase

A common approach: identify which team served in the rally. If a team receives serve, that is a sideout opportunity. If a team is serving, that is a break point opportunity. You can derive phase per team within each rally.

derive_rally_context <- function(df) {
  df %>%
    group_by(match_id, set_no, rally_id) %>%
    mutate(
      serving_team = team[which(skill == "serve")[1]],
      receiving_team = setdiff(unique(team), serving_team)[1],
      phase = case_when(
        team == receiving_team ~ "sideout",
        team == serving_team   ~ "transition",
        TRUE ~ NA_character_
      ) %>% factor(levels = c("sideout","transition"))
    ) %>%
    ungroup()
}

events <- derive_rally_context(events)

Core Volleyball KPIs (Serve, Pass, Attack, Block, Dig)

Volleyball KPIs are best computed from event tables with clear skill and evaluation codes. Below is a practical KPI set that works for scouting and performance analysis.

R code: define standard evaluation mappings

# Customize to your coding system.
eval_map <- list(
  serve = list(
    ace = c("ace"),
    error = c("error","serve_error"),
    in_play = c("in_play","good","ok","positive","negative")
  ),
  pass = list(
    perfect = c("perfect","3"),
    positive = c("positive","2","good"),
    negative = c("negative","1","poor"),
    error = c("error","0")
  ),
  attack = list(
    kill = c("kill"),
    error = c("error","attack_error"),
    blocked = c("blocked"),
    in_play = c("in_play","continuation","covered")
  )
)

is_eval <- function(x, values) tolower(x) %in% tolower(values)

R code: serve metrics (Ace%, Error%, Pressure proxy)

serve_metrics <- events %>%
  filter(skill == "serve") %>%
  mutate(
    is_ace = is_eval(evaluation, eval_map$serve$ace),
    is_error = is_eval(evaluation, eval_map$serve$error)
  ) %>%
  group_by(match_id, team) %>%
  summarise(
    serves = n(),
    aces = sum(is_ace),
    errors = sum(is_error),
    ace_pct = aces / serves,
    err_pct = errors / serves,
    .groups = "drop"
  )

serve_metrics

R code: passing metrics (Perfect%, Positive%, Passing Efficiency)

pass_metrics <- events %>%
  filter(skill == "pass") %>%
  mutate(
    perfect = is_eval(evaluation, eval_map$pass$perfect),
    positive = is_eval(evaluation, eval_map$pass$positive),
    negative = is_eval(evaluation, eval_map$pass$negative),
    error = is_eval(evaluation, eval_map$pass$error),
    # A common numeric scale (0..3)
    pass_score = case_when(
      perfect ~ 3,
      positive ~ 2,
      negative ~ 1,
      error ~ 0,
      TRUE ~ NA_real_
    )
  ) %>%
  group_by(match_id, team, player) %>%
  summarise(
    passes = n(),
    perfect_pct = mean(perfect, na.rm = TRUE),
    positive_pct = mean(positive, na.rm = TRUE),
    error_pct = mean(error, na.rm = TRUE),
    avg_pass = mean(pass_score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_pass), desc(passes))

pass_metrics %>% slice_head(n = 20)

R code: attack metrics (Kill%, Error%, Blocked%, Efficiency)

attack_metrics <- events %>%
  filter(skill == "attack") %>%
  mutate(
    kill = is_eval(evaluation, eval_map$attack$kill),
    error = is_eval(evaluation, eval_map$attack$error),
    blocked = is_eval(evaluation, eval_map$attack$blocked)
  ) %>%
  group_by(match_id, team, player) %>%
  summarise(
    attempts = n(),
    kills = sum(kill),
    errors = sum(error),
    blocks = sum(blocked),
    kill_pct = kills / attempts,
    error_pct = errors / attempts,
    blocked_pct = blocks / attempts,
    eff = (kills - errors) / attempts,
    .groups = "drop"
  ) %>%
  arrange(desc(eff), desc(attempts))

attack_metrics %>% slice_head(n = 20)

R code: blocking & digging (simple event-based)

defense_metrics <- events %>%
  filter(skill %in% c("block","dig")) %>%
  mutate(
    point = evaluation %in% c("stuff","kill_block","point"),
    error = evaluation %in% c("error","net","out")
  ) %>%
  group_by(match_id, team, player, skill) %>%
  summarise(
    actions = n(),
    points = sum(point),
    errors = sum(error),
    point_rate = points / actions,
    .groups = "drop"
  )

defense_metrics

Sideout, Break Point, Transition & Rally Phase Analytics

If you only measure one thing in volleyball, measure sideout efficiency. Most matches are decided by who wins more sideout points and who generates more break points. In R, you can compute SO% and BP% directly from rally winners and serving team.

R code: compute SO% and BP% per team

rallies <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    serving_team = team[which(skill == "serve")[1]],
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  ) %>%
  mutate(
    receiving_team = if_else(point_won_by == serving_team, NA_character_, NA_character_)
  )

# Derive receiving team robustly by looking at teams in the rally
rallies <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    teams_in_rally = list(unique(team)),
    serving_team = team[which(skill == "serve")[1]],
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  ) %>%
  mutate(
    receiving_team = map2_chr(teams_in_rally, serving_team, ~ setdiff(.x, .y)[1]),
    sideout_success = point_won_by == receiving_team,
    break_point_success = point_won_by == serving_team
  )

so_bp <- rallies %>%
  pivot_longer(cols = c(serving_team, receiving_team),
               names_to = "role", values_to = "team") %>%
  group_by(match_id, team, role) %>%
  summarise(
    opps = n(),
    points = sum(if_else(role == "receiving_team", sideout_success, break_point_success)),
    pct = points / opps,
    .groups = "drop"
  ) %>%
  mutate(metric = if_else(role == "receiving_team", "SO%", "BP%")) %>%
  select(match_id, team, metric, opps, points, pct)

so_bp

R code: First-ball sideout (FBSO) using pass quality

A classic volleyball KPI: do we sideout on the first attack after serve receive? Add pass quality segmentation: perfect/positive/negative passes and their first-ball sideout probability.

first_ball_sideout <- function(df) {
  # Identify: for each rally receiving team, find the first pass and first attack.
  df %>%
    group_by(match_id, set_no, rally_id) %>%
    mutate(
      serving_team = team[which(skill == "serve")[1]],
      receiving_team = setdiff(unique(team), serving_team)[1]
    ) %>%
    ungroup() %>%
    group_by(match_id, set_no, rally_id, receiving_team) %>%
    summarise(
      pass_eval = evaluation[which(skill == "pass" & team == receiving_team)[1]],
      first_attack_eval = evaluation[which(skill == "attack" & team == receiving_team)[1]],
      point_won_by = first(na.omit(point_won_by)),
      fbso = point_won_by == receiving_team & first_attack_eval %in% c("kill"),
      .groups = "drop"
    )
}

fbso <- first_ball_sideout(events) %>%
  mutate(
    pass_bucket = case_when(
      tolower(pass_eval) %in% eval_map$pass$perfect ~ "perfect",
      tolower(pass_eval) %in% eval_map$pass$positive ~ "positive",
      tolower(pass_eval) %in% eval_map$pass$negative ~ "negative",
      tolower(pass_eval) %in% eval_map$pass$error ~ "error",
      TRUE ~ "unknown"
    )
  ) %>%
  group_by(match_id, receiving_team, pass_bucket) %>%
  summarise(
    opps = n(),
    fbso_points = sum(fbso, na.rm = TRUE),
    fbso_pct = fbso_points / opps,
    .groups = "drop"
  ) %>%
  arrange(desc(fbso_pct))

fbso

Rotation, Lineup, Setter Distribution & Matchups

Rotation analysis is where volleyball analytics becomes coaching gold. Questions you can answer with R:

  • Which rotations are most efficient in sideout and transition?
  • Which lineups generate the best net rating (points won minus points lost)?
  • Does the setter distribution change under pressure or after poor passes?
  • Which matchup patterns appear vs. specific blockers or defenders?

R code: rotation efficiency

rotation_efficiency <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    serving_team = team[which(skill == "serve")[1]],
    point_won_by = first(na.omit(point_won_by)),
    # rotation of the receiving team at first pass (common reference)
    receiving_team = setdiff(unique(team), serving_team)[1],
    receive_rotation = rotation[which(skill == "pass" & team == receiving_team)[1]],
    .groups = "drop"
  ) %>%
  group_by(match_id, receiving_team, receive_rotation) %>%
  summarise(
    opps = n(),
    so_points = sum(point_won_by == receiving_team, na.rm = TRUE),
    so_pct = so_points / opps,
    .groups = "drop"
  ) %>%
  arrange(desc(so_pct))

rotation_efficiency

R code: setter distribution by pass quality and score pressure

# We assume "set" rows include target_zone or target_player info; if not, join from your tagging.
# This example uses end_zone as a proxy for set location (e.g., 4/2/3/back).
setter_distribution <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  mutate(
    serving_team = team[which(skill == "serve")[1]],
    receiving_team = setdiff(unique(team), serving_team)[1],
    receive_pass_score = case_when(
      skill == "pass" & team == receiving_team & tolower(evaluation) %in% eval_map$pass$perfect ~ 3,
      skill == "pass" & team == receiving_team & tolower(evaluation) %in% eval_map$pass$positive ~ 2,
      skill == "pass" & team == receiving_team & tolower(evaluation) %in% eval_map$pass$negative ~ 1,
      skill == "pass" & team == receiving_team & tolower(evaluation) %in% eval_map$pass$error ~ 0,
      TRUE ~ NA_real_
    )
  ) %>%
  ungroup() %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    team = first(receiving_team),
    pass_score = first(na.omit(receive_pass_score)),
    set_zone = end_zone[which(skill == "set" & team == first(receiving_team))[1]],
    score_diff = (first(na.omit(score_team)) - first(na.omit(score_opp))),
    pressure = abs(score_diff) <= 2,  # "close score" proxy
    .groups = "drop"
  ) %>%
  filter(!is.na(set_zone), !is.na(pass_score)) %>%
  mutate(pass_bucket = factor(pass_score, levels = c(0,1,2,3),
                              labels = c("error","negative","positive","perfect")))

setter_distribution_summary <- setter_distribution %>%
  group_by(team, pass_bucket, pressure, set_zone) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(team, pass_bucket, pressure) %>%
  mutate(pct = n / sum(n)) %>%
  arrange(team, pass_bucket, pressure, desc(pct))

setter_distribution_summary

This is the foundation for scouting reports: “On perfect passes in close score, they set Zone 4 ~52%.”

Serve & Serve-Receive Analytics (Zones, Heatmaps, Pressure)

Modern serve analytics combines zone targeting, pass degradation, and point outcomes. Even if you don’t track ball coordinates, zones 1–6 (or 1–9) are enough for powerful insights.

R code: serve target heatmap by end_zone

library(ggplot2)

serve_zones <- events %>%
  filter(skill == "serve") %>%
  count(team, end_zone, name = "serves") %>%
  group_by(team) %>%
  mutate(pct = serves / sum(serves)) %>%
  ungroup()

ggplot(serve_zones, aes(x = factor(end_zone), y = pct)) +
  geom_col() +
  facet_wrap(~ team) +
  labs(
    title = "Serve Target Distribution by Zone",
    x = "End Zone (Serve Target)",
    y = "Share of Serves"
  )

R code: serve pressure proxy via opponent pass score

serve_pressure <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    serving_team = team[which(skill == "serve")[1]],
    receiving_team = setdiff(unique(team), serving_team)[1],
    serve_end_zone = end_zone[which(skill == "serve")[1]],
    pass_eval = evaluation[which(skill == "pass" & team == receiving_team)[1]],
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  ) %>%
  mutate(
    pass_score = case_when(
      tolower(pass_eval) %in% eval_map$pass$perfect ~ 3,
      tolower(pass_eval) %in% eval_map$pass$positive ~ 2,
      tolower(pass_eval) %in% eval_map$pass$negative ~ 1,
      tolower(pass_eval) %in% eval_map$pass$error ~ 0,
      TRUE ~ NA_real_
    ),
    pressure = pass_score <= 1,
    ace = FALSE # if you track aces at serve level, set it here
  )

serve_pressure_summary <- serve_pressure %>%
  group_by(serving_team, serve_end_zone) %>%
  summarise(
    serves = n(),
    avg_opp_pass = mean(pass_score, na.rm = TRUE),
    pressure_rate = mean(pressure, na.rm = TRUE),
    bp_rate = mean(point_won_by == serving_team, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(bp_rate))

serve_pressure_summary

With this table, you can say: “Serving zone 5 creates low passes 38% of the time and increases break-point rate.”

Attack Shot Charts, Zones, Tendencies & Scouting

Attack analytics becomes powerful when you connect attack zone, target area, block context, and outcome. Even simple zone models can guide scouting: “Their opposite hits sharp to zone 1 on bad passes.”

R code: attack tendency table by start_zone → end_zone

attack_tendencies <- events %>%
  filter(skill == "attack") %>%
  count(team, player, start_zone, end_zone, name = "attempts") %>%
  group_by(team, player) %>%
  mutate(pct = attempts / sum(attempts)) %>%
  ungroup() %>%
  arrange(team, player, desc(pct))

attack_tendencies %>% slice_head(n = 30)

R code: attack efficiency by zone and pass bucket

attack_with_pass <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  mutate(
    serving_team = team[which(skill == "serve")[1]],
    receiving_team = setdiff(unique(team), serving_team)[1],
    pass_eval = evaluation[which(skill == "pass" & team == receiving_team)[1]]
  ) %>%
  ungroup() %>%
  filter(skill == "attack", team == receiving_team) %>%
  mutate(
    pass_bucket = case_when(
      tolower(pass_eval) %in% eval_map$pass$perfect ~ "perfect",
      tolower(pass_eval) %in% eval_map$pass$positive ~ "positive",
      tolower(pass_eval) %in% eval_map$pass$negative ~ "negative",
      tolower(pass_eval) %in% eval_map$pass$error ~ "error",
      TRUE ~ "unknown"
    ),
    kill = tolower(evaluation) %in% eval_map$attack$kill,
    error = tolower(evaluation) %in% eval_map$attack$error
  ) %>%
  group_by(team, player, start_zone, pass_bucket) %>%
  summarise(
    attempts = n(),
    kill_pct = mean(kill, na.rm = TRUE),
    eff = (sum(kill) - sum(error)) / attempts,
    .groups = "drop"
  ) %>%
  arrange(desc(eff))

attack_with_pass

R code: simple shot chart plot (end_zone)

shot_chart <- events %>%
  filter(skill == "attack") %>%
  mutate(
    outcome = case_when(
      tolower(evaluation) %in% eval_map$attack$kill ~ "kill",
      tolower(evaluation) %in% eval_map$attack$error ~ "error",
      tolower(evaluation) %in% eval_map$attack$blocked ~ "blocked",
      TRUE ~ "in_play"
    )
  )

ggplot(shot_chart, aes(x = factor(end_zone), fill = outcome)) +
  geom_bar(position = "fill") +
  facet_wrap(~ player) +
  labs(
    title = "Attack Outcome Mix by Target Zone (End Zone)",
    x = "Target Zone",
    y = "Share"
  )

Modeling: Expected Sideout, Win Probability, Elo, Markov Chains

Once your event model is clean, you can move beyond descriptive KPIs into modeling: expected sideout (xSO), expected point (xP), win probability, and strategy simulation.

R code: expected sideout (logistic regression baseline)

library(broom)

# Create a rally-level modeling table
rally_model_df <- events %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    serving_team = team[which(skill == "serve")[1]],
    receiving_team = setdiff(unique(team), serving_team)[1],
    pass_eval = evaluation[which(skill == "pass" & team == receiving_team)[1]],
    pass_score = case_when(
      tolower(pass_eval) %in% eval_map$pass$perfect ~ 3,
      tolower(pass_eval) %in% eval_map$pass$positive ~ 2,
      tolower(pass_eval) %in% eval_map$pass$negative ~ 1,
      tolower(pass_eval) %in% eval_map$pass$error ~ 0,
      TRUE ~ NA_real_
    ),
    serve_zone = end_zone[which(skill == "serve")[1]],
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  ) %>%
  filter(!is.na(pass_score), !is.na(serve_zone)) %>%
  mutate(
    sideout_success = point_won_by == receiving_team
  )

# Baseline xSO model
xso_fit <- glm(
  sideout_success ~ pass_score + factor(serve_zone),
  data = rally_model_df,
  family = binomial()
)

tidy(xso_fit)
summary(xso_fit)

rally_model_df <- rally_model_df %>%
  mutate(xSO = predict(xso_fit, type = "response"))

rally_model_df %>%
  group_by(receiving_team) %>%
  summarise(
    actual_SO = mean(sideout_success),
    expected_SO = mean(xSO),
    delta = actual_SO - expected_SO,
    .groups = "drop"
  ) %>%
  arrange(desc(delta))

R code: simple set-level win probability from score differential

# If you have event-level score columns, you can build a win probability model.
# Here we illustrate a simple logistic model from score differential and set number.

wp_df <- events %>%
  filter(!is.na(score_team), !is.na(score_opp)) %>%
  mutate(score_diff = score_team - score_opp) %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    team = first(team),
    score_diff = first(score_diff),
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  ) %>%
  mutate(won_point = point_won_by == team)

wp_fit <- glm(won_point ~ score_diff + factor(set_no), data = wp_df, family = binomial())
wp_df <- wp_df %>%
  mutate(win_prob_point = predict(wp_fit, type = "response"))

wp_fit %>% broom::tidy()

R code: Elo ratings for volleyball teams

# Minimal Elo example (team-level). You can replace with your season match table.
matches <- tibble(
  match_id = c("m1","m2","m3"),
  date = as.Date(c("2025-09-01","2025-09-05","2025-09-10")),
  home = c("Team A","Team B","Team A"),
  away = c("Team B","Team C","Team C"),
  winner = c("Team A","Team C","Team A")
)

elo_update <- function(r_home, r_away, home_won, k = 20) {
  p_home <- 1 / (1 + 10^((r_away - r_home)/400))
  s_home <- ifelse(home_won, 1, 0)
  r_home_new <- r_home + k * (s_home - p_home)
  r_away_new <- r_away + k * ((1 - s_home) - (1 - p_home))
  list(home = r_home_new, away = r_away_new, p_home = p_home)
}

teams <- sort(unique(c(matches$home, matches$away)))
ratings <- setNames(rep(1500, length(teams)), teams)

elo_log <- vector("list", nrow(matches))

for (i in seq_len(nrow(matches))) {
  m <- matches[i,]
  rH <- ratings[[m$home]]
  rA <- ratings[[m$away]]
  upd <- elo_update(rH, rA, home_won = (m$winner == m$home))
  ratings[[m$home]] <- upd$home
  ratings[[m$away]] <- upd$away
  elo_log[[i]] <- tibble(match_id = m$match_id, p_home = upd$p_home,
                         home = m$home, away = m$away,
                         winner = m$winner,
                         r_home_pre = rH, r_away_pre = rA,
                         r_home_post = upd$home, r_away_post = upd$away)
}

bind_rows(elo_log) %>% arrange(match_id)
tibble(team = names(ratings), elo = as.numeric(ratings)) %>% arrange(desc(elo))

R code: Markov chain model for rally outcomes (conceptual starter)

A Markov model represents rally states like: Serve → Pass → Set → Attack → (Point/Continuation). Below is a lightweight starting template to estimate transition probabilities from event sequences.

library(stringr)

# Build simple sequences per rally: skill chain for receiving team until point ends
rally_sequences <- events %>%
  arrange(match_id, set_no, rally_id) %>%
  group_by(match_id, set_no, rally_id) %>%
  summarise(
    serving_team = team[which(skill == "serve")[1]],
    receiving_team = setdiff(unique(team), serving_team)[1],
    seq = paste(skill, collapse = "-"),
    point_won_by = first(na.omit(point_won_by)),
    .groups = "drop"
  )

# Count bigrams (transitions) from sequences
extract_bigrams <- function(seq_str) {
  tokens <- str_split(seq_str, "-", simplify = TRUE)
  tokens <- tokens[tokens != ""]
  if (length(tokens) < 2) return(tibble(from = character(), to = character()))
  tibble(from = tokens[-length(tokens)], to = tokens[-1])
}

transitions <- rally_sequences %>%
  mutate(bigrams = map(seq, extract_bigrams)) %>%
  select(match_id, bigrams) %>%
  unnest(bigrams) %>%
  count(from, to, name = "n") %>%
  group_by(from) %>%
  mutate(p = n / sum(n)) %>%
  ungroup() %>%
  arrange(from, desc(p))

transitions

Predictive Modeling with tidymodels

If you want production-grade modeling in R, use tidymodels: pipelines, cross-validation, recipes, metrics, and model tuning. Here is an end-to-end example predicting sideout success using pass score + serve zone.

R code: tidymodels xSO pipeline

library(tidymodels)

df <- rally_model_df %>%
  mutate(
    serve_zone = factor(serve_zone),
    receiving_team = factor(receiving_team)
  )

set.seed(2026)
split <- initial_split(df, prop = 0.8, strata = sideout_success)
train <- training(split)
test  <- testing(split)

rec <- recipe(sideout_success ~ pass_score + serve_zone, data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

model <- logistic_reg() %>%
  set_engine("glm")

wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(model)

fit <- wf %>% fit(data = train)

pred <- predict(fit, test, type = "prob") %>%
  bind_cols(test %>% select(sideout_success))

roc_auc(pred, truth = sideout_success, .pred_TRUE)
accuracy(predict(fit, test) %>% bind_cols(test), truth = sideout_success, estimate = .pred_class)

R code: add player random effects with mixed models (glmm)

# For player/team variation, you can use lme4 (not tidymodels-native).
install.packages("lme4")
library(lme4)

# Example: include receiving_team as a random intercept
xso_glmm <- glmer(
  sideout_success ~ pass_score + factor(serve_zone) + (1 | receiving_team),
  data = rally_model_df,
  family = binomial()
)

summary(xso_glmm)

Bayesian Volleyball Analytics in R

Bayesian models are ideal when you want uncertainty, shrinkage, and better inference with small samples. In volleyball scouting, sample sizes can be tiny (a few matches), so Bayesian partial pooling is often a win.

R code: Bayesian xSO with brms

# Bayesian logistic regression with partial pooling by receiving team
install.packages("brms")
library(brms)

bayes_fit <- brm(
  sideout_success ~ pass_score + factor(serve_zone) + (1 | receiving_team),
  data = rally_model_df,
  family = bernoulli(),
  chains = 2, cores = 2, iter = 1500,
  seed = 2026
)

summary(bayes_fit)
posterior_summary(bayes_fit)

With brms, you can compute posterior distributions of SO% by team, compare strategies, and avoid overreacting to noise.

Visualization: ggplot2 Templates for Volleyball

Volleyball visualizations should be coach-friendly, quick to read, and tied to decisions: serve target, pass quality, rotation weaknesses, attack tendencies, and pressure points.

R code: SO% and BP% report chart

so_bp_wide <- so_bp %>%
  select(team, metric, pct) %>%
  pivot_wider(names_from = metric, values_from = pct)

so_bp_long <- so_bp %>%
  ggplot(aes(x = team, y = pct, fill = metric)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(title = "Sideout % and Break Point % by Team", x = NULL, y = "Rate")

so_bp_long

R code: rotation heatmap (SO% by rotation)

rot_plot_df <- rotation_efficiency %>%
  mutate(receive_rotation = factor(receive_rotation, levels = 1:6))

ggplot(rot_plot_df, aes(x = receive_rotation, y = receiving_team, fill = so_pct)) +
  geom_tile() +
  labs(title = "Rotation Sideout Heatmap", x = "Rotation (Receiving)", y = "Team")

R code: fast HTML tables with gt

library(gt)

attack_metrics %>%
  filter(attempts >= 10) %>%
  arrange(desc(eff)) %>%
  gt() %>%
  fmt_percent(columns = c(kill_pct, error_pct, blocked_pct), decimals = 1) %>%
  fmt_number(columns = eff, decimals = 3) %>%
  tab_header(title = "Attack Leaderboard (Min 10 Attempts)")

Dashboards: Shiny Scouting Reports

A Shiny scouting app can deliver instant insights for coaches: opponent serve targets, rotation weaknesses, attacker tendencies, and key matchups. Below is a compact Shiny template you can expand.

R code: minimal Shiny dashboard for team scouting

install.packages(c("shiny","bslib"))
library(shiny)
library(bslib)
library(tidyverse)

# Assume you already computed:
# - serve_pressure_summary
# - rotation_efficiency
# - attack_tendencies

ui <- page_sidebar(
  title = "Volleyball Analytics Dashboard (R + Shiny)",
  sidebar = sidebar(
    selectInput("team", "Select Team", choices = sort(unique(serve_pressure_summary$serving_team))),
    hr(),
    helpText("Key views: serve targets, rotation sideout, attack tendencies.")
  ),
  layout_columns(
    card(
      card_header("Serve Targets by Zone"),
      plotOutput("servePlot", height = 260)
    ),
    card(
      card_header("Rotation Sideout %"),
      plotOutput("rotPlot", height = 260)
    ),
    card(
      card_header("Top Attack Tendencies"),
      tableOutput("attackTable")
    )
  )
)

server <- function(input, output, session) {

  output$servePlot <- renderPlot({
    df <- serve_pressure_summary %>% filter(serving_team == input$team)
    ggplot(df, aes(x = factor(serve_end_zone), y = bp_rate)) +
      geom_col() +
      labs(x = "Serve End Zone", y = "Break Point Rate", title = paste("Serve Effectiveness -", input$team))
  })

  output$rotPlot <- renderPlot({
    df <- rotation_efficiency %>% filter(receiving_team == input$team) %>%
      mutate(receive_rotation = factor(receive_rotation, levels = 1:6))
    ggplot(df, aes(x = receive_rotation, y = so_pct)) +
      geom_col() +
      labs(x = "Rotation", y = "Sideout %", title = paste("Rotation Sideout -", input$team))
  })

  output$attackTable <- renderTable({
    attack_tendencies %>%
      filter(team == input$team) %>%
      group_by(player) %>%
      slice_max(order_by = pct, n = 5) %>%
      ungroup() %>%
      arrange(desc(pct)) %>%
      mutate(pct = round(pct * 100, 1))
  })
}

shinyApp(ui, server)

Automation: Reports to HTML/PDF + CI

One of the best uses of R in volleyball: automated weekly scouting reports. Generate: HTML match report, PDF coaching packet, and tables/figures for staff.

R code: Quarto report skeleton

# Create a Quarto (.qmd) file like reports/match_report.qmd
# Then render in R:
# quarto::quarto_render("reports/match_report.qmd")

# Example render call:
quarto::quarto_render(
  input = "reports/match_report.qmd",
  execute_params = list(match_id = "match_001")
)

Example Quarto front matter (paste into .qmd)

---
title: "Match Report"
format:
  html:
    toc: true
    code-fold: show
execute:
  echo: true
  warning: false
  message: false
params:
  match_id: "match_001"
---

Best Practices + Common Pitfalls

  • Define evaluation codes once and reuse them everywhere (serve/pass/attack mappings).
  • Keep raw data immutable in data/raw; write cleaned data to data/processed.
  • Separate scouting vs. performance analysis: scouting focuses on tendencies; performance focuses on efficiency.
  • Beware small samples (one match). Use Bayesian shrinkage or confidence intervals.
  • Rotation context matters: opponent rotations, server strength, and pass quality heavily confound results.
  • Don’t overfit: models should generalize across matches and opponents.
  • Make outputs coach-readable: simple tables, clear charts, and “so what?” conclusions.

R code: quick bootstrap CI for SO%

set.seed(2026)

bootstrap_ci <- function(x, B = 2000, conf = 0.95) {
  n <- length(x)
  boots <- replicate(B, mean(sample(x, n, replace = TRUE)))
  alpha <- (1 - conf) / 2
  quantile(boots, probs = c(alpha, 1 - alpha), na.rm = TRUE)
}

so_ci <- rallies %>%
  mutate(sideout_success = point_won_by == receiving_team) %>%
  group_by(receiving_team) %>%
  summarise(
    so = mean(sideout_success),
    ci_low = bootstrap_ci(sideout_success)[1],
    ci_high = bootstrap_ci(sideout_success)[2],
    n = n(),
    .groups = "drop"
  )

so_ci

FAQ

What’s the best single metric in volleyball?

If you only track one KPI: Sideout %. It correlates strongly with winning because it reflects serve-receive stability and first-ball offense conversion.

How do I handle different coding systems?

Create a mapping layer (like eval_map) and convert raw labels into a standardized internal vocabulary. The rest of your pipeline should never depend on raw coding strings.

Can I do volleyball analytics without coordinates?

Yes. Zone-based analytics (1–6 or 1–9) plus pass quality and outcome are enough for rotation analysis, serve targeting, and basic predictive modeling.

What should I build first?

Start with: import + clean → SO% / BP% → pass + serve dashboards → rotation sideout → attack efficiency by pass quality. Once those are stable, add modeling.

Tags: volleyball analytics with R, R volleyball stats, sideout percentage, rotation analysis, serve receive, scouting report, tidymodels, ggplot2, Shiny dashboard

The post Volleyball Analytics with R: The Complete Guide to Match Data, Sideout Efficiency, Serve Pressure, Heatmaps, and Predictive Models appeared first on R Programming Books.

To leave a comment for the author, please follow the link and comment on their blog: Blog – R Programming Books.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Volleyball Analytics with R: The Complete Guide to Match Data, Sideout Efficiency, Serve Pressure, Heatmaps, and Predictive Models]]>

398365


Setting Up A Cluster of Tiny PCs For Parallel Computing – A Note To Myself https://www.r-bloggers.com/2026/01/setting-up-a-cluster-of-tiny-pcs-for-parallel-computing-a-note-to-myself/

Fri, 16 Jan 2026 00:00:00 +0000
https://www.kenkoonwong.com/blog/parallel-computing/

Enjoyed learning the process of setting up a cluster of tiny PCs for parallel computing. A note to myself on installing Ubuntu, passwordless SSH, automating package installation across nodes, distributing R simulations, and comparing CV5 vs CV10 performance. Fun project!

Motivations

Part of something I want to learn this year …

Continue reading: Setting Up A Cluster of Tiny PCs For Parallel Computing – A Note To Myself]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Enjoyed learning the process of setting up a cluster of tiny PCs for parallel computing. A note to myself on installing Ubuntu, passwordless SSH, automating package installation across nodes, distributing R simulations, and comparing CV5 vs CV10 performance. Fun project!

Motivations

Part of something I want to learn this year is getting a little more into parallel computing. How we can distribute simulation computations across different devices. Lately, we have more reasons to do this because quite a few of our simulations require long running computation and leaving my laptop running overnight or several days is just not a good use it. We have also tried cloud computing as well and without knowing how those distributed cores are, well, distributed, it’s hard for me to conceptualize how these are done and what else we could optimize. Hence, what is a better way of doing it on our own! Sit tight, this is going to be a bumpy one. Let’s go!

Objectives

Which PCs to Get?

image

Preferably something functional and cheap! Something like a used Lenovo M715q Tiny PCs or something similar.

Install Ubuntu

image

  1. Download Ubuntu Server
  2. Create a bootable USB using balenaEtcher
  3. When starting Lenovo up, press F12 continuously until it shows an option to boot from USB. If F12 does not work, reboot and press F1 to BIOS. Go to Startup Tab, change CSM Support to Enabled. Then set Primary Boot Priority to USB by moving priority to first. Then F10 to save configuration and exit. It will then reboot to USB.
  4. Make sure it’s connected to internet via LAN for smoother installation.
  5. Follow the instructions to install Ubuntu, setting username, password etc. Then reboot.
  6. Make sure to remove USB drive, if you didn’t it’ll remind you. Et voila!

The installations were very quick, compared to the other OS I’ve installed in the past. Very smooth as well. I thoroughly enjoyed seeting these up.

Align and Fix IPs

For organizational purpose, make sure you go to your router setting and set your computer clusters to convenient IPs such as 192.168.1.101, 192.168.1.102, 192.168.1.103 etc. You may have to reboot your computer clusters after changing it on your router.

Passwordless SSH

Next, you want to set up passwordless SSH. This is crucial for R to work!

1. Create a key

ssh-keygen -t ed25519

2. Send Copy of Key To Your Node

ssh-copy-id -i .ssh/my_key.pub username1@192.168.1.101

it will prompt you to enter your password, then after that you won’t need a pssword to ssh in.

Passwordless Sudo

This is optional. But if you’re like me, don’t want to repeat lots of typing on installation, and see if you can use bash or R to install packages, you’d need this.

ssh -t username2@192.168.1.102 'echo "$(whoami) ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/$(whoami)'

It would prompt you to enter your password. You would have to do this for all your nodes

Send Multiple Commands Via SSH

Install R

for host in username1@192.168.1.101 username2@192.168.1.102 username3@192.168.1.103; do
  ssh -t $host 'sudo apt update && sudo apt install -y r-base r-base-dev'
done

This is basically installing R on all of our clusters one after another.

Create A Template R script For Simulation

Why do we do this? We want to take advantage of the multicore of each nodes as opposed to using clusters on future as the overhead network may add on to the time and makes optimization less efficiency. Instead, we will send a script to each node so that it can fork its own cores to run the simulation. Also, if we specify packages on our script, we can automate the process of installing these packages on our nodes.

code
library(future)
library(future.apply)
library(dplyr)
library(SuperLearner)
library(ranger)
library(xgboost)
library(glmnet)

plan(multicore, workers = 4)

set.seed(1)

n <- 10000
W1 <- rnorm(n)
W2 <- rnorm(n)
W3 <- rbinom(n, 1, 0.5)
W4 <- rnorm(n)

# TRUE propensity score model
A <- rbinom(n, 1, plogis(-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4))

# TRUE outcome model
Y <- rbinom(n, 1, plogis(-1 + 0.2*A + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2))

# Calculate TRUE ATE
logit_Y1 <- -1 + 0.2 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2
logit_Y0 <- -1 + 0 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2

Y1_true <- plogis(logit_Y1)
Y0_true <- plogis(logit_Y0)
true_ATE <- mean(Y1_true - Y0_true)

df <- tibble(W1 = W1, W2 = W2, W3 = W3, W4 = W4, A = A, Y = Y)

tune <- list(
  ntrees = c(500,1000),
  max_depth = c(5,7),
  shrinkage = c(0.001,0.01)
)

tune2 <- list(
  ntrees = c(250, 500, 1000),
  max_depth = c(3,5,7,9),
  shrinkage = c(0.001,0.005,0.01)
)

learners <- create.Learner("SL.xgboost", tune = tune, detailed_names = TRUE, name_prefix = "xgb")
learners2 <- create.Learner("SL.xgboost", tune = tune2, detailed_names = TRUE, name_prefix = "xgb")

# Super Learner library
SL_library <- list(
  c("SL.xgboost", "SL.ranger", "SL.glm", "SL.mean"),
  c("SL.xgboost","SL.ranger"),
  c("SL.xgboost","SL.glm"),
  list("SL.ranger", c("SL.xgboost", "screen.glmnet")),
  c("SL.glmnet","SL.glm"),
  c("SL.ranger","SL.glm"),
  c(learners$names, "SL.glm"),
  c(learners$names, "SL.glmnet"),
  c("SL.gam","SL.glm"),
  c(learners2$names, "SL.glm"))

# sample
allnum <- START:END
n_sample <- length(allnum)
n_i <- 6000

# Function to run one TMLE iteration
run_tmle_iteration <- function(seed_val, df, n_i, SL_library) {
  set.seed(seed_val)
  data <- slice_sample(df, n = n_i, replace = T) |> select(Y, A, W1:W4)

  # Prepare data
  X_outcome <- data |> select(A, W1:W4) |> as.data.frame()
  X_treatment <- data |> select(W1:W4) |> as.data.frame()
  Y_vec <- data$Y
  A_vec <- data$A

  # Outcome model
  SL_outcome <- SuperLearner(
    Y = Y_vec,
    X = X_outcome,
    family = binomial(),
    SL.library = SL_library,
    cvControl = list(V = 5)
  )

  # Initial predictions
  outcome <- predict(SL_outcome, newdata = X_outcome)$pred

  # Predict under treatment A=1
  X_outcome_1 <- X_outcome |> mutate(A=1)
  outcome_1 <- predict(SL_outcome, newdata = X_outcome_1)$pred

  # Predict under treatment A=0
  X_outcome_0 <- X_outcome |> mutate(A=0)
  outcome_0 <- predict(SL_outcome, newdata = X_outcome_0)$pred

  # Bound outcome predictions to avoid qlogis issues
  outcome <- pmax(pmin(outcome, 0.9999), 0.0001)
  outcome_1 <- pmax(pmin(outcome_1, 0.9999), 0.0001)
  outcome_0 <- pmax(pmin(outcome_0, 0.9999), 0.0001)

  # Treatment model
  SL_treatment <- SuperLearner(
    Y = A_vec,
    X = X_treatment,
    family = binomial(),
    SL.library = SL_library,
    cvControl = list(V = 5)
  )

  # Propensity scores
  ps <- predict(SL_treatment, newdata = X_treatment)$pred

  # Truncate propensity scores
  ps_final <- pmax(pmin(ps, 0.95), 0.05)

  # Calculate clever covariates
  a_1 <- 1/ps_final
  a_0 <- -1/(1 - ps_final)
  clever_covariate <- ifelse(A_vec == 1, 1/ps_final, -1/(1 - ps_final))

  epsilon_model <- glm(Y_vec ~ -1 + offset(qlogis(outcome)) + clever_covariate,
                       family = "binomial")
  epsilon <- coef(epsilon_model)

  updated_outcome_1 <- plogis(qlogis(outcome_1) + epsilon * a_1)
  updated_outcome_0 <- plogis(qlogis(outcome_0) + epsilon * a_0)

  # Calc ATE
  ate <- mean(updated_outcome_1 - updated_outcome_0)

  # Calc SE
  updated_outcome <- ifelse(A_vec == 1, updated_outcome_1, updated_outcome_0)
  se <- sqrt(var((Y_vec - updated_outcome) * clever_covariate +
                   updated_outcome_1 - updated_outcome_0 - ate) / n_i)

  return(list(ate = ate, se = se))
}

# Run iterations in parallel
for (num in 1:length(SL_library)) {
  if (num %in% c(1:9)) { next }
  cat(num)
  cat("TMLE iterations in parallel with 4 workers (multicore)...n")
  start_time <- Sys.time()

  results_list <- future_lapply(START:END, function(i) {
    result <- run_tmle_iteration(i, df, n_i, SL_library[[num]])
    if (i %% 100 == 0) cat("Completed iteration:", i, "n")
    return(result)
  }, future.seed = TRUE)

  end_time <- Sys.time()
  run_time <- end_time - start_time

  # Extract results
  predicted_ate <- sapply(results_list, function(x) x$ate)
  pred_se <- sapply(results_list, function(x) x$se)

  # Results
  results <- tibble(
    iteration = START:END,
    ate = predicted_ate,
    se = pred_se,
    ci_lower = ate - 1.96 * se,
    ci_upper = ate + 1.96 * se,
    covers_truth = true_ATE >= ci_lower & true_ATE <= ci_upper
  )

  # Summary stats
  summary_stats <- tibble(
    metric = c("true_ATE", "mean_estimated_ATE", "median_estimated_ATE",
               "sd_estimates", "mean_SE", "coverage_probability", "bias"),
    value = c(
      true_ATE,
      mean(predicted_ate),
      median(predicted_ate),
      sd(predicted_ate),
      mean(pred_se),
      mean(results$covers_truth),
      mean(predicted_ate) - true_ATE
    )
  )

  # Create output directory if it doesn't exist
  if (!dir.exists("tmle_results")) {
    dir.create("tmle_results")
  }

  # Save detailed results (all iterations)
  write.csv(results, paste0("tmle_results/tmle_iterations",num,".csv"), row.names = FALSE)

  # Save summary statistics
  write.csv(summary_stats, paste0("tmle_results/tmle_summary",num,".csv"), row.names = FALSE)

  # Save simulation parameters
  sim_params <- tibble(
    parameter = c("n_population", "n_sample_iterations", "n_bootstrap_size",
                  "SL_library", "n_workers", "runtime_seconds"),
    value = c(n, n_sample, n_i,
              paste(SL_library[[num]], collapse = ", "),
              4, as.numeric(run_time, units = "secs"))
  )
  write.csv(sim_params, paste0("tmle_results/simulation_parameters",num,".csv"), row.names = FALSE)

  # Save as RData for easy loading
  save(results, summary_stats, sim_params, true_ATE, file = paste0("tmle_results/tmle_results",num,".RData"))

}

What we did above is basically a template script (We are saving this as par_test_script.R), one where we can edit where to begin and end in terms of which iteration to start and end per node. And also instruction to save result. This is when we can put a little more effort in incorporating some instructions to inform us when task is completed (e.g., via email) and also it would also be nice to know what is the ETA of the entire task by perhaps benchmarking how long the first iteration took to complete, then multiple by total iters per node. Again, this can be sent via email, and also maybe only for the first node as opposed to all nodes, so we’re not bombarded with messages beginning and the end. 🤣

Install Packages On All Nodes

## List all of our nodes
my_clusters <- list(
  c("username1@192.168.1.101"),
  c("username2@192.168.1.102"),
  c("username3@192.168.1.103"))


## Grab all of the packages needed on our script
packages <- gsub("library(([^)]+))", "1",grep("^library",readLines("par_test_script.R"),value = T))

## Create function to run sudo
remote_r_sudo <- function(host, r_code, intern = FALSE) {
  escaped <- gsub('"', '\"', r_code)
  cmd <- sprintf("ssh %s 'sudo Rscript -e "%s"'", host, escaped)
  system(cmd, intern = intern)
}

## Loop over to install
for (cluster_i in my_clusters) {
  print(cluster_i)
  for (package in packages) {
  command <- sprintf('if (!require("%s")) install.packages("%s")', package, package)
  remote_r_sudo(cluster_i, command)
  }
}

Make sure your computer doesn’t go to sleep with this. If this is the first time your nodes are installing these extensive libraries, it will take a while. Another way we can do this is to use future_lapply for all nodes and also tmux for all installations so that we don’t need to rely on our local workstation to be turned on to continue with the installation. See below on how we used tmux to set and forget method.

Upload Rscript to Nodes

Alright, now we have installed the appropriate packages above, let’s upload scripts to our nodes.

Distribute Work

num_list <- list()
clust_num <- 3
total_loop <- 1000
div_iter <- total_loop/clust_num
final_iter <- total_loop #only use this for custom e.g., if one node did not work and it's in charge of 300:500, we can put 500 for this and set first_iter as 300
first_iter <- 1
last_iter <- round(div_iter,0) + first_iter

for (i in 1:clust_num) {
  if (i == clust_num) {
    num_list[[i]] <- paste0(first_iter,":",final_iter)
    next
  }
  num_list[[i]] <- paste0(first_iter,":",last_iter)
  first_iter <- round(first_iter + div_iter, 0)
  last_iter <- round(last_iter + div_iter, 0)
}

num_list

## [[1]]
## [1] "1:334"
##
## [[2]]
## [1] "334:667"
##
## [[3]]
## [1] "667:1000"

for (i in 1:length(my_clusters)) {
  username <- sub("@.*","",my_clusters[[i]])
  system(sprintf("sed 's/START:END/%s/g' par_test_script.R > par_test_script1.R & scp par_test_script1.R %s:/home/%s/par_test_script1.R",num_list[[i]],my_clusters[[i]],username))
}

We’ll iterate and insert the appropriate iters for each node and save it to par_test_script1.R. Then upload to each nodes with the code above.

Check set.seed in multicore

sample_df <- function(seed, df, n = 6000) {
  set.seed(seed)
  df_sample <- slice_sample(n = n, .data = df)
  return(df_sample)
}

future_lapply(100, function(x) sample_df(seed=x,df=df))

When we did the above on local computer and also in terminal with multicore. It’s still the same! Woo hoo!

image

image

The interesting thing is I didn’t have to set future.seed = T or future.seed = some_number for this. However, if we put a number on future.seed, it will return the reproducible data! This is great, next time I’ll just use this seed and I don’t have to use set.seed(i). 🙌

Run Rscript

for (i in 1:length(my_clusters)) {
  # set your tmux new session name, here we call it "test"
  cluster_name <- "test"

  # terminate any existing tmux with the existing name
  system(sprintf("ssh %s 'tmux kill-session -t %s 2>/dev/null || true'", my_clusters[[i]], cluster_name))

  # create new tmux session
  system(sprintf("ssh %s 'tmux new-session -d -s %s'", my_clusters[[i]], cluster_name))

  # run rscript in tmux
  system(sprintf("ssh %s 'tmux send-keys -t %s "Rscript par_test_script1.R > result_%d.txt"' ENTER",
                 my_clusters[[i]], cluster_name, i))
}

The code above is quite self-explanatory. Once the above code is run and completed, there we have it! it should be running in the background! 🙌 You can do a spot check and see if it’s actually running. Once completed, we’ll extract the data.

Extract Data

Since we have 10 combinations we want to assess, we will set nums as 1:10 and get our data! On your template script you can set however you want to save your data, and for extraction, just look for those and download them, read and merge! Or however you want to do it.

nums <- 1:10
df <- tibble()
for (num in nums) {
  print(num)
for (i in 1:length(my_clusters)) {
  response <- system(sprintf("scp %s:tmle_results/simulation_parameters%d.csv simulation_parameters%d.csv", my_clusters[[i]], num, num), intern = F)
  if (response == 1) { next }
  df_i <- read_csv(paste0("simulation_parameters",num,".csv"), show_col_types = F)
  sl_i <- df_i |> filter(parameter == "SL_library") |> pull(value)
  df <- rbind(df, df_i |> mutate(method = sl_i, num = num))
}
}

df_sim_param <- df

df <- tibble()
for (num in nums) {
for (i in 1:length(my_clusters)) {
  response <- system(sprintf("scp %s:tmle_results/tmle_iterations%d.csv tmle_iterations%d.csv", my_clusters[[i]], num, num), intern = F)
  if (response == 1) { print(paste0(my_clusters[[i]]," is missing num", num)) ; next }
  df_i <- read_csv(paste0("tmle_iterations",num,".csv"), show_col_types = F) |>
    mutate(num = num)
  df <- rbind(df, df_i)
}
}

df_iter <- df

Take note that sometimes you may encounter issues, if for some reason the node is unable to complete the task, you can identify it then redistribute those tasks to the entire computer cluster.

Compare Time

Let’s take at our compute time for 1 cluster, 3 cluster with 5-fold cv, 3 cluster with 10-fold cv.

method hour_1clus_cv5 hour_3clus_cv5 hour_3clus_cv10
SL.xgboost, SL.ranger, SL.glm, SL.mean 4.02 1.4126466 2.5179200
SL.xgboost, SL.ranger 4.00 1.4136567 2.5108584
SL.xgboost, SL.glm 0.47 0.1680019 0.3034212
SL.ranger, c(“SL.xgboost”, “screen.glmnet”) 4.23 1.4960542 2.5165429
SL.glmnet, SL.glm NA 0.1074466 0.1995869
SL.ranger, SL.glm NA 1.2544446 2.2254909
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glm 3.29 1.8059939 3.3030737
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glmnet NA 1.8956873 3.4821903
SL.gam, SL.glm NA 0.1094693 0.2072266
xgb_250_3_0.001, xgb_500_3_0.001, xgb_1000_3_0.001, xgb_250_5_0.001, xgb_500_5_0.001, xgb_1000_5_0.001, xgb_250_7_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_250_9_0.001, xgb_500_9_0.001, xgb_1000_9_0.001, xgb_250_3_0.005, xgb_500_3_0.005, xgb_1000_3_0.005, xgb_250_5_0.005, xgb_500_5_0.005, xgb_1000_5_0.005, xgb_250_7_0.005, xgb_500_7_0.005, xgb_1000_7_0.005, xgb_250_9_0.005, xgb_500_9_0.005, xgb_1000_9_0.005, xgb_250_3_0.01, xgb_500_3_0.01, xgb_1000_3_0.01, xgb_250_5_0.01, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_250_7_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, xgb_250_9_0.01, xgb_500_9_0.01, xgb_1000_9_0.01, SL.glm NA NA 4.6127172

Looking at the time, we can definitely see the improvement of time from 1 cluster to 3 cluster. Take a look at our good old tuned xgboost and logistic regression, it took use previously for a quadcore 3.29 hours to complete, down to 1.8 hours. You’d imagine that if we use 3 pc’s as a cluster, we would notice improvement to ~1.1 hours, but I guess not for xgboost. Will have to investigate this. But if we look at xgboost + logistic regression without tuning, we went from 0.47 hours to 0.17 hours which made sense! Very interesting. Now if we up our CV to 10 fold, then we see that it took longer (makes senses), but still better than using 1 quadcore. I’ve heard people said that if you increase your K-fold CV, you reduce your bias, but increase variance. Let’s see if that’s true in our case here.

method bias_3clus_cv5 bias_3clus_cv10 variance_3clus_cv5 variance_3clus_cv10
SL.xgboost, SL.ranger, SL.glm, SL.mean -0.0007695 -0.0007257 0.0001866 0.0001940
SL.xgboost, SL.ranger -0.0007677 -0.0007257 0.0001866 0.0001940
SL.xgboost, SL.glm -0.0010481 0.0001018 0.0001586 0.0001617
SL.ranger, c(“SL.xgboost”, “screen.glmnet”) -0.0008349 -0.0007257 0.0001868 0.0001940
SL.glmnet, SL.glm -0.0449075 -0.0449065 0.0001502 0.0001503
SL.ranger, SL.glm -0.0007695 -0.0007257 0.0001866 0.0001940
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glm 0.0006449 0.0010681 0.0001491 0.0001504
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glmnet 0.0005986 0.0010492 0.0001502 0.0001511
SL.gam, SL.glm -0.0062967 -0.0062967 0.0001537 0.0001537
xgb_250_3_0.001, xgb_500_3_0.001, xgb_1000_3_0.001, xgb_250_5_0.001, xgb_500_5_0.001, xgb_1000_5_0.001, xgb_250_7_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_250_9_0.001, xgb_500_9_0.001, xgb_1000_9_0.001, xgb_250_3_0.005, xgb_500_3_0.005, xgb_1000_3_0.005, xgb_250_5_0.005, xgb_500_5_0.005, xgb_1000_5_0.005, xgb_250_7_0.005, xgb_500_7_0.005, xgb_1000_7_0.005, xgb_250_9_0.005, xgb_500_9_0.005, xgb_1000_9_0.005, xgb_250_3_0.01, xgb_500_3_0.01, xgb_1000_3_0.01, xgb_250_5_0.01, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_250_7_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, xgb_250_9_0.01, xgb_500_9_0.01, xgb_1000_9_0.01, SL.glm NA 0.0013250 NA 0.0001528

Wow, not too shabby! Indeed when we went from cv5 to cv10, we have reduced bias and slightly increased variance! How about that. Everything except gam + lr, which make sense because we don’t really tune them. Though that being said, I wonder what’s under the hood that controls the knot for gam in superlearner. Will need to check that out. With this, it looks like tuned xgboost + lr might have the best numbers. Well, now we’ve seen bias and variance, what about coverage?

method coverage_3clus_cv5 coverage_3clus_cv10
SL.xgboost, SL.ranger, SL.glm, SL.mean 0.536 0.517
SL.xgboost, SL.ranger 0.536 0.517
SL.xgboost, SL.glm 0.811 0.799
SL.ranger, c(“SL.xgboost”, “screen.glmnet”) 0.539 0.517
SL.glmnet, SL.glm 0.051 0.052
SL.ranger, SL.glm 0.536 0.517
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glm 0.882 0.878
xgb_500_5_0.001, xgb_1000_5_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, SL.glmnet 0.881 0.876
SL.gam, SL.glm 0.926 0.926
xgb_250_3_0.001, xgb_500_3_0.001, xgb_1000_3_0.001, xgb_250_5_0.001, xgb_500_5_0.001, xgb_1000_5_0.001, xgb_250_7_0.001, xgb_500_7_0.001, xgb_1000_7_0.001, xgb_250_9_0.001, xgb_500_9_0.001, xgb_1000_9_0.001, xgb_250_3_0.005, xgb_500_3_0.005, xgb_1000_3_0.005, xgb_250_5_0.005, xgb_500_5_0.005, xgb_1000_5_0.005, xgb_250_7_0.005, xgb_500_7_0.005, xgb_1000_7_0.005, xgb_250_9_0.005, xgb_500_9_0.005, xgb_1000_9_0.005, xgb_250_3_0.01, xgb_500_3_0.01, xgb_1000_3_0.01, xgb_250_5_0.01, xgb_500_5_0.01, xgb_1000_5_0.01, xgb_250_7_0.01, xgb_500_7_0.01, xgb_1000_7_0.01, xgb_250_9_0.01, xgb_500_9_0.01, xgb_1000_9_0.01, SL.glm NA 0.844
as not expecting gam + lr to have so much coverage! But looking at bias from the previous table, it’s actually quite horrible. So it seems like gam + lr is assymetrical in its estimates, sometimes overestimating, sometimes underestimating, leading to a wider confidence interval, hence more coverage. But that being said, it’s not a good estimator because of its bias. Tuned xgboost + glmnet seems to be the best bet here with low bias, low variance and decent coverage. Wow, I was not expecting gam + lr to have so much coverage! But looking at bias from the previous table, it’s actually quite horrible. Let’s visualize it!

5-fold CV

code
library(tidyverse)

num_df <- sim_param_cv5_clus5 |>
  select(num, method)

df_coverage <- df_iter_cv5_clus3 |>
  group_by(num) |>
  arrange(ate) |>
  mutate(iter = row_number()) |>
  mutate(cover = case_when(
    covers_truth == F & ate < true_ATE ~ "right_missed",
    covers_truth == F & ate > true_ATE ~ "left_missed",
    covers_truth == T ~ "covered"
  )) |>
  select(num, cover) |>
  group_by(num, cover) |>
  tally() |>
  ungroup(cover) |>
  mutate(prop = n*100/sum(n)) |>
  pivot_wider(id_cols = num, names_from = "cover", values_from = "prop") |>
  mutate(text = paste0("right missed: ",right_missed,"% covered: ",covered,"% left missed: ",left_missed,"%")) |>
  select(num, text)

method <- tibble(
  num = c(1:9),
  method = c("xgb + rf + lr + mean","xgb + rf","xgb + lr","rf + (xgb + preprocess w glmnet)","glmnet + lr","rf + lr","tuned xgb + lr","tuned xgb + glmnet","gam + lr")
)

plot <- df_iter_cv5_clus3 |>
  group_by(num) |>
  arrange(ate) |>
  mutate(iter = row_number()) |>
  mutate(cover = case_when(
    covers_truth == F & ate < true_ATE ~ "right_missed",
    covers_truth == F & ate > true_ATE ~ "left_missed",
    covers_truth == T ~ "covered"
  )) |>
  ggplot(aes(x=iter,y=ate,color=cover)) +
  geom_point(alpha=0.2) +
  geom_errorbar(aes(x=iter,ymin=ci_lower,ymax=ci_upper), alpha=0.2) +
  geom_hline(aes(yintercept=0.0373518), color = "blue") +
  geom_text(data = df_coverage,
            aes(x = 500, label = text),
            y = -0.05,
            inherit.aes = FALSE,
            size = 3,
            hjust = 0.5) +
  scale_color_manual(values = c("covered" = "#619CFF",
                                  "left_missed" = "#F8766D",
                                  "right_missed" = "#00BA38")) +
  theme_bw() +
  facet_wrap(.~num, ncol = 1,labeller = as_labeller(setNames(method$method, method$num))) +
  theme(legend.position = "bottom")

lr: logistic regression, xgb: xgboost, rf : random forest, gam : generalized additive model.

Wow, look at gam + lr’s assymetrical coverage! This is so true then when we’re assessing, a point estimate of coverage is not adequate to assess the global usefulness of a method. We can see that this method is very bias indeed with asymmetrical tails. Since CV5 and CV10 do not have significant difference in coverage, we’ll skip the visualization.

Opportunities for improvement

Lessons Learnt:

If you like this article:

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Setting Up A Cluster of Tiny PCs For Parallel Computing – A Note To Myself]]>

398351


admiral 1.4 release https://www.r-bloggers.com/2026/01/admiral-1-4-release/

Thu, 15 Jan 2026 00:00:00 +0000
https://pharmaverse.github.io/blog/posts/2026-01-15_admiral_14/admiral_1.4_release.html

admiral 1.4 is here!
There’s nothing like a new {admiral} release to start the new year with a bang, and that’s exactly what we have for you with {admiral} 1.4! This release comes packed with few choice updates and some exciting new features…

Continue reading: admiral 1.4 release]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

admiral 1.4 is here!

There’s nothing like a new {admiral} release to start the new year with a bang, and that’s exactly what we have for you with {admiral} 1.4! This release comes packed with few choice updates and some exciting new features (including our first AI integration!) that will undoubtedly improve your experience creating ADaMs in R. However, in line with our commitment to stability and maturity, we have once again reduced the number of breaking changes to an absolute minimum. Take a look at the rest of the blog post below to learn more about what to expect in {admiral} 1.4.

TLDR: What is in this release?

  • {admiral} has partnered with kapa.ai to introduce an AI-powered chatbot within every page of our websites.
  • We have added our first experimental functions: derive_var_nfrlt() and convert_xxtpt_to_hours(). These simplify time point and visit derivations.
  • There is a new ADAB template script and correspondingly new vignette. These will aid in creating Anti-drug Antibody Analysis Datasets
  • Several existing parameter derivation functions have been updated for greater robustness.
  • All the {admiral} templates can now be viewed directly from our website within the new Explore ADaM Templates.
  • {admiral} is now integrated with NCI-CTCAE version 6.0 lab grading criteria for both SI and US units, with all grading metadata now maintained in a more readable JSON format for improved consistency and maintainability.
  • derive_var_atoxgr_dir() received a small breaking change: the abnormal_indicator argument was replaced by low_indicator and high_indicator to support the updated CTCAE v6.0 criteria.
  • derive_var_merged_summary() has been renamed to derive_vars _merged_summary() to align with programming conventions.

Please peruse the Changelog to see the full set of updates in {admiral} 1.4.

Let’s take a few deep-dives to explore some of these updates in more detail.

Deep dive 1: kapa.ai integration

We are thrilled to announce that {admiral} has partnered with kapa.ai under their Open Source Program to introduce an AI-powered chatbot within every page of our website. This LLM chatbot is trained using the documentation pages of {admiral} and all our therapeutic area extension packages as well as the {pharmaversesdtm} and {pharmaverseadam} data packages and the ADaM IG. Indeed, look out for it to be added to every other package in the {admiral} ecosystem in their upcoming releases this month. To use the chatbot, simply click on the “Ask AI” button at the bottom right corner of any website page and start asking away – you may find that you get redirected to a User Guide, a function reference page or the website of another {admiral} package that you didn’t even know existed!

Deep dive 2: Experimental functions!

Experimental functions are a new class of functions, tagged with the “Experimental” badge. This is our way of adding targeted, new functionality that we can quickly improve/modify without breaking our commitment to stability. The badge will be removed once the dev team feels the function is stable, and no deprecation messages will be given to the user if breaking changes are implemented before removing the experimental badge. However, {admiral} will document the breaking change in the Changelog. Once the “Experimental” badge is removed we will proceed with the normal deprecation cycle if needed.

Our first experimental functions help derive variables for time-point analysis. One such variable is NFRLT (Nominal Relative Time from First Dose), which is used extensively in pharmacokinetic analysis. This is the namesake variable for derive_var_nfrlt(). This function can be used to directly derive NFRLT in the Pharmacokinetic Concentrations (PC) domain and in the Exposure (EX) domain. These domains are typically combined to form the Pharmacokinetic Concentrations Analysis Data (ADPC). Under the hood of this function is convert_xxtpt_to_hours(), which converts timepoint variables such as PCTPT to numeric hours using regular expressions. The hours computed from convert_xxtpt_to_hours() are combined with the days from a visit variable such as VISITDY to derive NFRLT. Note that the unit variable FRLTU can also be derived, and there are multiple output unit options (hours, days, weeks, minutes).

install.packages("admiral", repos = "https://cloud.r-project.org")
Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
also installing the dependency 'admiraldev'
library(admiral)
library(tibble)

adpc_nfrlt <- tribble(
  ~USUBJID, ~VISITDY, ~PCTPT,           ~PCSTRESN,
  "001",    1,        "Pre-dose",       0.0,
  "001",    1,        "30M Post-dose",  5.2,
  "001",    1,        "1H Post-dose",   8.4,
  "001",    1,        "2H Post-dose",   12.1,
  "001",    1,        "4H Post-dose",   10.3,
  "001",    1,        "8H Post-dose",   6.8,
  "001",    1,        "24H Post-dose",  2.1
)

adpc_nfrlt %>%
  derive_var_nfrlt(
    new_var = NFRLT,
    new_var_unit = FRLTU,
    out_unit = "HOURS",
    tpt_var = PCTPT,
    visit_day = VISITDY
  )
# A tibble: 7 × 6
  USUBJID VISITDY PCTPT         PCSTRESN NFRLT FRLTU
  <chr>     <dbl> <chr>            <dbl> <dbl> <chr>
1 001           1 Pre-dose           0     0   HOURS
2 001           1 30M Post-dose      5.2   0.5 HOURS
3 001           1 1H Post-dose       8.4   1   HOURS
4 001           1 2H Post-dose      12.1   2   HOURS
5 001           1 4H Post-dose      10.3   4   HOURS
6 001           1 8H Post-dose       6.8   8   HOURS
7 001           1 24H Post-dose      2.1  24   HOURS

The functions can work with any time-point variables from other domains such as LB, VS or ’EG`. There a is also a treatment duration option in the case of infusions and an option for time point ranges such as “4-8H AFTER END OF INFUSION”.

Deep dive 3: Anti-drug-antibody template and vignette

{admiral} 1.4 introduces a new ADaM template script to aid in creating Anti-drug Antibody Analysis Datasets (ADAB). This template is accompanied by a detailed vignette that walks you through the process of deriving key variables and implementing analysis strategies specific to anti-drug antibody data, including the use of one of our new experimental functions derive_var_nfrlt(). The template covers common scenarios encountered in immunogenicity assessments, providing a solid foundation for users to build upon and customize according to their study requirements.

As usual, to use start using the template you can run:

use_ad_template("adab")

…or you can peruse the template directly from the website from our new Explore ADaM Templates page.

What’s coming in admiral 1.5?

{admiral} is very much community driven, so please continue reaching out through Slack or GitHub if you have ideas or requests for enhancements to our package. {admiral} 1.5 is due to release in June 2026, and some current ideas for new content include improved Estimands documentation and ways to enable positive censoring within TTE derivations.

Last updated

2026-01-15 21:08:50.200778

Details

Reuse

Citation

BibTeX citation:
@online{dickinson2026,
  author = {Dickinson, Jeff and Straub, Ben and Mancini, Edoardo},
  title = {Admiral 1.4 Release},
  date = {2026-01-15},
  url = {https://pharmaverse.github.io/blog/posts/2026-01-15_admiral_14/admiral_1.4_release.html},
  langid = {en}
}
For attribution, please cite this work as:
Dickinson, Jeff, Ben Straub, and Edoardo Mancini. 2026. “Admiral 1.4 Release.” January 15, 2026. https://pharmaverse.github.io/blog/posts/2026-01-15_admiral_14/admiral_1.4_release.html.
To leave a comment for the author, please follow the link and comment on their blog: pharmaverse blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: admiral 1.4 release]]>

398317


LLMs can’t be trusted to do scientific coding accurately, but humans make mistakes too https://www.r-bloggers.com/2026/01/llms-cant-be-trusted-to-do-scientific-coding-accurately-but-humans-make-mistakes-too/

Tue, 13 Jan 2026 13:00:00 +0000
https://www.seascapemodels.org/posts/2026-01-14-LLMs-cant-be-trusted-but-neither-can-humans/

I often hear the comment that LLMs/generative AI (large language models) can’t be trusted for research tasks.
Image Google’s Nano Banana tasked with “Generate an image of a male African researcher holding a balloon that is pulling them up above…

Continue reading: LLMs can’t be trusted to do scientific coding accurately, but humans make mistakes too]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Seascapemodels, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

I often hear the comment that LLMs/generative AI (large language models) can’t be trusted for research tasks.

Image Google’s Nano Banana tasked with “Generate an image of a male African researcher holding a balloon that is pulling them up above a tidal wave of AI generated slop that is full of errors. The balloon has a research paper inside of it. Generate the image in the style of a Simpsons cartoon.”

But this is the wrong way to think about LLMs. Humans also can’t be trusted to do scientific research accurately. They make mistakes. That’s why we have systems for review.

The more important question is: Are LLMs more accurate than humans at completing a given task?

I actually think LLMs might lead to better scientific coding and statistical analysis.

A common example of what LLMs get criticised for is writing code or performing statistical analyses. The LLM might hallucinate non-truths, or at least mislead you into thinking the analysis you have done is scientifically accurate.

The implication is that we should not be using them for particular tasks, like designing statistical models.

Its right to be skeptical of AI produced output. However, we also need to be skeptical of human produced output. Humans make mistakes as well.

As scientists peer-review is baked into our culture. But code review is much rarer. We also don’t have many systematic reviews of scientific coding that have quantified the rate of mistakes.

I suspect that mistakes in scientific coding are more common than we’d like to believe.

In one (rare) example, researchers reviewed population modelling analyses and found mathematical errors were common. One type of error occured in 62% of studies!

Now I haven’t set an LLM agent the task of doing the equivalent population models to see what its error rate is. However, my tests (which are under review) of agents at quite complicated stats and ecological modelling are showing 80-90% performance at accurately completing the tasks.

So the LLM agents are potentially doing better than the humans and making fewer mistakes.

Why I think LLMs might lead to better research is that they give us more time for code review.

As an ecological modeller I invest a ton of time into writing code, then checking that code works the way I want (and in a mathematically accurate way).

LLMs are now doing more of the code writing for me. Used effectively, this gives me more time to review the code for accuracy, as well as checking the code is an accurate representation of the scientific theory.

A human with an LLM partner could choose to: (1) produce crap work faster than pre-LLM, OR (2) produce higher quality work in a similar amount of time to what it took them pre-LLM.

I’m arguing that we should be aiming to produce the higher quality work. We can do this if we use LLMs to speed up code, then use the extra time for more quality assurance.

More generally, don’t get fooled by the argument that “genAI makes mistakes, so it can’t be trusted”.

Its the wrong way to think about the problem, and I think will lead us to being blind-sided by the oncoming flood of research slop created with genAI.

A better way to think about it is: “genAI and humans both make mistakes, how can we design workflows so that their strengths complement each other and we produce higher quality work”.

This will give us outcomes that are of higher quality than the pre-LLM world, and hopefully will rise above the huge quantity of AI generated slop that is currently happening.

To leave a comment for the author, please follow the link and comment on their blog: Seascapemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: LLMs can’t be trusted to do scientific coding accurately, but humans make mistakes too]]>

398303


GOLD/SILVER RATIO: GenAI with Quant Agents on Microsoft AI Foundry https://www.r-bloggers.com/2026/01/gold-silver-ratio-genai-with-quant-agents-on-microsoft-ai-foundry/

Tue, 13 Jan 2026 11:44:07 +0000
http://datageeek.com/?p=11590

1. Introduction: The Strategic Edge of Agentic Finance In the contemporary landscape of quantitative finance, the bottleneck is no longer data availability, but the speed of insight generation. Leveraging the Microsoft AI Foundry ecosystem, we have moved beyond static scripting into the realm of Autonomous Financial Agents. This article explores how …

Continue reading: GOLD/SILVER RATIO: GenAI with Quant Agents on Microsoft AI Foundry]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

1. Introduction: The Strategic Edge of Agentic Finance

In the contemporary landscape of quantitative finance, the bottleneck is no longer data availability, but the speed of insight generation. Leveraging the Microsoft AI Foundry ecosystem, we have moved beyond static scripting into the realm of Autonomous Financial Agents. This article explores how a specialized agent can navigate precious metal volatility by analyzing the Gold/Silver ratio with high-performance precision.

2. Infrastructure: Model Deployment on Microsoft AI Foundry

The intelligence behind this analysis is not a local script but a deployed model instance on Microsoft AI Foundry. We utilize the GPT-4o model, deployed as a scalable web service within the Foundry environment.

3. The Technical Bridge: Python-R Integration

One of the most powerful features of our AI Foundry Agent is its multi-lingual capability. It bridges the gap between Python and R using the rpy2 library, creating a high-performance research pipeline.

The R Ecosystem in Play:

4. Methodology: Taming the Noise with Visual Precision

To extract actionable trends, the Agent is instructed to apply a LOESS smoothing algorithm. By strictly setting .line_size = 1.5 and .smooth_size = 1.5, we ensure the trendline is bold enough to be the primary focus for analysts, effectively “taming” the daily price volatility.

5. Conclusion: Scaling Quantitative Research

The synergy between Microsoft AI Foundry, deployed LLMs, and specialized R packages represents the future of financial research. We have replaced manual data wrangling with an autonomous, standardized agent that can be scaled across thousands of different asset pairs with a single command.

The ABI Connection (Bridging Python to R in VS Code)

For the script to run locally in VS Code, we must establish a robust Application Binary Interface (ABI) connection. This is handled by the rpy2 library, which serves as the translation layer between Python and the R interpreter.

import os

# Force rpy2 to use ABI mode to avoid the Windows CFFI conflict
os.environ['RPY2_CFFI_MODE'] = 'ABI'

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
print("Interface initialized in ABI mode.")

The Integrated Agent Script:

import os
import httpx
from openai import AzureOpenAI
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
from IPython.display import Image, display

#Microsoft AI Foundry - Azure OpenAI Connection
client = AzureOpenAI(
    api_version="2024-12-01-preview",
    azure_endpoint="AZURE_OPENAI_ENDPOINT",
    api_key="AZURE_OPENAI_KEY",
    http_client=httpx.Client(verify=False, trust_env=False)
)

def run_updated_agent(user_request):
    system_instructions = (
        "You are a Quantitative Researcher. MANDATORY: All output, comments, and labels in English. "
        "Strict Operational Guidelines:n"
        "1. Libraries: library(tidyquant), library(timetk), library(lubridate), library(dplyr), library(ggplot2).n"
        "2. Analysis: Fetch GC=F and SI=F for 3 years, merge via inner_join, and calculate 'ratio'.n"
        "3. Visualization: Use timetk::plot_time_series with .interactive = FALSE and .title = "Gold/Silver Ratio".n"
        "4. Precision: Set .line_size = 2 and ALWAYS set .smooth_size = 2 for the smoothing line.n"
        "5. Set title font face and axis texts font face to 'bold', and size to 16 with theme() function.n"
        "6. EXPORT: Save using 'ggsave("ratio_plot.png", width = 10, height = 6)'.n"
        "7. Output ONLY raw R code."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_instructions},
            {"role": "user", "content": user_request}
        ]
    )

    # Cleaning any markdown or headers to get raw code
    agent_code = response.choices[0].message.content.strip()
    if agent_code.startswith("```"):
        agent_code = "n".join(agent_code.split("n")[1:-1])

    print("-" * 40)
    print(agent_code)
    print("-" * 40)

    try:
        with localconverter(robjects.default_converter + pandas2ri.converter):
            # Synchronize working directory
            robjects.r(f'setwd("{os.getcwd().replace("", "/")}")')
            robjects.r(agent_code)

            if os.path.exists("ratio_plot.png"):
                display(Image(filename="ratio_plot.png"))
    except Exception as e:
        print(f"Agent Error: {e}")

# Execution
run_updated_agent("Plot the Gold/Silver ratio for the last 3 years with a smooth line.")
To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: GOLD/SILVER RATIO: GenAI with Quant Agents on Microsoft AI Foundry]]>



398254


Predicting Best Picture at the 2026 Academy Awards https://www.r-bloggers.com/2026/01/predicting-best-picture-at-the-2026-academy-awards/

Tue, 13 Jan 2026 03:33:43 +0000
http://www.r-bloggers.com/?guid=bb51a00e49490a8f96df1cb1ac58e6d0

I’m back with the Oscars Best Picture model, albeit a little late. I had a
busy holiday season, but the story of December was surprising: The Secret
Agent was the favorite, followed by One Battle After Another. This was
largely due to The Secret Agent’s runtime, which is …

Continue reading: Predicting Best Picture at the 2026 Academy Awards]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Mark H. White II, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

January 12, 2026

I’m back with the Oscars Best Picture model, albeit a little late. I had a busy holiday season, but the story of December was surprising: The Secret Agent was the favorite, followed by One Battle After Another. This was largely due to The Secret Agent’s runtime, which is right in the sweet spot for Best Picture winners.

However, remember from the last two years that my model is assuming these movies have been nominated for Best Picture. The biggest barrier facing The Secret Agent is being nominated; it is a non-English language film. While the only non-English language film to win has been Parasite (2019), the nominations are rare enough that, given that the film has been nominated, having no English dialogue isn’t a barrier to winning. The DGA and PGA nominations make me think The Secret Agent won’t be nominated, however.

It also surprised me that One Battle After Another wasn’t favored more, given Paul Thomas Anderson is a generationally phenomenal writer-director, but none of his films have won Best Picture. Looking into the data, it looks like the “career award” is not much of a thing for Best Picture (like it seems to be for the acting and directing categories). Just the opposite: If a director has had a film nominated or won Best Picture before, it actually hurts their chances of winning in my model.

That was then, this is now, though. No more awards in my models will name nominees or winners before the Oscar nominations. Where do we stand going into the announcement?

One Battle After Another is the favorite, at about 15% chance of winning. Following it closely is The Secret Agent (10%), followed by Marty Supreme (9%), Hamnet (8%), Wicked: For Good (8%), and Frankenstein (8%).

You can read more about the details of the model from my post last year and the year before. The big change I’ve made here is calibrate the probabilities so that the model isn’t too sure of itself. I will see you on the other side of the Oscar nominations.

 
To leave a comment for the author, please follow the link and comment on their blog: Mark H. White II, PhD.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Predicting Best Picture at the 2026 Academy Awards]]>

398238


From scripts to pipelines in the age of LLMs https://www.r-bloggers.com/2026/01/from-scripts-to-pipelines-in-the-age-of-llms/

Tue, 13 Jan 2026 00:00:00 +0000
https://b-rodrigues.github.io/posts/2026-01-13-data_science_llm_age.html

I was recently reading Davis Vaughan’s blog post Semi-automating 200 Pull Requests with Claude Code and it really resonated with me, as I’ve been using LLMs for tedious tasks like that for some time now. Davis’s key insight: structure = su…

Continue reading: From scripts to pipelines in the age of LLMs]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

I was recently reading Davis Vaughan’s blog post Semi-automating 200 Pull Requests with Claude Code and it really resonated with me, as I’ve been using LLMs for tedious tasks like that for some time now. Davis’s key insight: structure = success. When you can scope a task tightly and provide clear context, LLMs become genuinely useful tools.

If you’ve been following my work, you know that reproducible pipelines have been my main focus for some time now. It’s the reason I wrote {rix} for reproducible R environments, {rixpress} for declarative pipelines, and even a Python port called ryxpress. I genuinely believe these tools make data science better: more reproducible, more debuggable, more shareable.

But I also know that getting people to adopt new tools is hard. Learning a new way of structuring your code takes time and effort, and most people are busy enough already. Here’s where LLMs enter the picture: they can help translate your existing scripts into this more structured format. You provide your monolithic script, explain what you want, and the LLM does the grunt work of restructuring it.

The typical way we write analytics scripts (long chains of %>% calls in R or method-chaining in Python) works fine for interactive exploration, but quickly turns into spaghetti that’s hard to modify, test, or debug. Take my old Luxembourg Airport analysis as an example: it works, but turning that kind of script into a proper pipeline with caching, explicit dependencies, and testability is tedious work.

But we’re in 2026 where LLMs now make this trivial.

From implicit to explicit: translating a script into a rixpress pipeline

Let me show you what I mean by translating that old Luxembourg Airport data cleaning code into a {rixpress} pipeline. The original script uses continuous %>% chaining, standard tidyverse style. The {rixpress} version makes every intermediate step explicit.

But here’s the thing: LLMs don’t know about {rixpress} out of the box. It’s a relatively new package, and it’s not in their training data (or if it is, it’s outdated). So I wrote another tool called pkgctx that extracts structured, LLM-ready API specifications from R or Python packages (and guess what, I did not write a single line of code of that tool, LLMs did).

The idea is simple: run pkgctx r rixpress > rixpress.ctx.yaml and you get a compact YAML file describing every function, its arguments, and its purpose. Feed that to an LLM, and suddenly it “knows” the package. All my packages now have such a file generated on each push thanks to GitHub Actions. Here’s how my conversation went:

Me: I want to use a package called rixpress, but I am not familiar with it. You’ll help me, and here is a file with a detailed description of the API, for context.

[uploaded rixpress.ctx.yaml]

LLM: Got it. I’ll use this API specification as the authoritative reference for rixpress and help you write correct, idiomatic pipelines with it.

A few orientation points to set the frame (no action required yet):

  • rixpress is declarative: you describe derivations (rxp_r(), rxp_py(), rxp_qmd(), …), then materialize them with rxp_populate() and rxp_make().
  • Data flows through the DAG via serialized artifacts (RDS / pickle / Julia Serialization), not in-memory objects.
  • Language boundaries are explicit (rxp_r2py(), rxp_py2r()).
  • Nix is the execution engine; default.nix (or equivalents) define the runtime environment.

When you’re ready, tell me what you want to do […]

Then I simply asked:

Me: Help me translate this R script into a rixpress pipeline: [pasted the old script]

And that’s how I got a working {rixpress} pipeline. The LLM did the tedious restructuring; I reviewed the output, made minor tweaks, and was done. The combination of pkgctx for context and a clear task (“translate this script”) made the LLM genuinely useful.

Now let’s look at what the translated pipeline looks like. First, let’s assume:

  • The data file avia_par_lu.tsv is in the project directory
  • Required R packages are available via default.nix (we’ll also use an LLM for this one)
  • The project has been initialized with rxp_init() (this sets up two skeleton files to get started quickly)
Click to expand the full rixpress pipeline
library(rixpress)

# Step 0: Load the data
avia <- rxp_r_file(
  name = avia,
  path = "avia_par_lu.tsv",
  read_function = readr::read_tsv
)

# Step 1: Select and reshape (wide → long)
avia_long <- rxp_r(
  name = avia_long,
  expr =
    avia %>%
      select("unit,tra_meas,airp_prtime", contains("20")) %>%
      gather(date, passengers, -`unit,tra_meas,airp_prtime`)
)

# Step 2: Split composite key column
avia_split <- rxp_r(
  name = avia_split,
  expr =
    avia_long %>%
      separate(
        col = `unit,tra_meas,airp_prtime`,
        into = c("unit", "tra_meas", "air_prtime"),
        sep = ","
      )
)

# Step 3: Recode transport measure
avia_recode_tra_meas <- rxp_r(
  name = avia_recode_tra_meas,
  expr =
    avia_split %>%
      mutate(
        tra_meas = fct_recode(
          tra_meas,
          `Passengers on board` = "PAS_BRD",
          `Passengers on board (arrivals)` = "PAS_BRD_ARR",
          `Passengers on board (departures)` = "PAS_BRD_DEP",
          `Passengers carried` = "PAS_CRD",
          `Passengers carried (arrival)` = "PAS_CRD_ARR",
          `Passengers carried (departures)` = "PAS_CRD_DEP",
          `Passengers seats available` = "ST_PAS",
          `Passengers seats available (arrivals)` = "ST_PAS_ARR",
          `Passengers seats available (departures)` = "ST_PAS_DEP",
          `Commercial passenger air flights` = "CAF_PAS",
          `Commercial passenger air flights (arrivals)` = "CAF_PAS_ARR",
          `Commercial passenger air flights (departures)` = "CAF_PAS_DEP"
        )
      )
)

# Step 4: Recode unit
avia_recode_unit <- rxp_r(
  name = avia_recode_unit,
  expr =
    avia_recode_tra_meas %>%
      mutate(
        unit = fct_recode(
          unit,
          Passenger = "PAS",
          Flight = "FLIGHT",
          `Seats and berths` = "SEAT"
        )
      )
)

# Step 5: Recode destination
avia_recode_destination <- rxp_r(
  name = avia_recode_destination,
  expr =
    avia_recode_unit %>%
      mutate(
        destination = fct_recode(
          `air_prtime`,
          `WIEN-SCHWECHAT` = "LU_ELLX_AT_LOWW",
          `BRUSSELS` = "LU_ELLX_BE_EBBR",
          `GENEVA` = "LU_ELLX_CH_LSGG",
          `ZURICH` = "LU_ELLX_CH_LSZH",
          `FRANKFURT/MAIN` = "LU_ELLX_DE_EDDF",
          `HAMBURG` = "LU_ELLX_DE_EDDH",
          `BERLIN-TEMPELHOF` = "LU_ELLX_DE_EDDI",
          `MUENCHEN` = "LU_ELLX_DE_EDDM",
          `SAARBRUECKEN` = "LU_ELLX_DE_EDDR",
          `BERLIN-TEGEL` = "LU_ELLX_DE_EDDT",
          `KOBENHAVN/KASTRUP` = "LU_ELLX_DK_EKCH",
          `HURGHADA / INTL` = "LU_ELLX_EG_HEGN",
          `IRAKLION/NIKOS KAZANTZAKIS` = "LU_ELLX_EL_LGIR",
          `FUERTEVENTURA` = "LU_ELLX_ES_GCFV",
          `GRAN CANARIA` = "LU_ELLX_ES_GCLP",
          `LANZAROTE` = "LU_ELLX_ES_GCRR",
          `TENERIFE SUR/REINA SOFIA` = "LU_ELLX_ES_GCTS",
          `BARCELONA/EL PRAT` = "LU_ELLX_ES_LEBL",
          `ADOLFO SUAREZ MADRID-BARAJAS` = "LU_ELLX_ES_LEMD",
          `MALAGA/COSTA DEL SOL` = "LU_ELLX_ES_LEMG",
          `PALMA DE MALLORCA` = "LU_ELLX_ES_LEPA",
          `SYSTEM - PARIS` = "LU_ELLX_FR_LF90",
          `NICE-COTE D'AZUR` = "LU_ELLX_FR_LFMN",
          `PARIS-CHARLES DE GAULLE` = "LU_ELLX_FR_LFPG",
          `STRASBOURG-ENTZHEIM` = "LU_ELLX_FR_LFST",
          `KEFLAVIK` = "LU_ELLX_IS_BIKF",
          `MILANO/MALPENSA` = "LU_ELLX_IT_LIMC",
          `BERGAMO/ORIO AL SERIO` = "LU_ELLX_IT_LIME",
          `ROMA/FIUMICINO` = "LU_ELLX_IT_LIRF",
          `AGADIR/AL MASSIRA` = "LU_ELLX_MA_GMAD",
          `AMSTERDAM/SCHIPHOL` = "LU_ELLX_NL_EHAM",
          `WARSZAWA/CHOPINA` = "LU_ELLX_PL_EPWA",
          `PORTO` = "LU_ELLX_PT_LPPR",
          `LISBOA` = "LU_ELLX_PT_LPPT",
          `STOCKHOLM/ARLANDA` = "LU_ELLX_SE_ESSA",
          `MONASTIR/HABIB BOURGUIBA` = "LU_ELLX_TN_DTMB",
          `ENFIDHA-HAMMAMET INTERNATIONAL` = "LU_ELLX_TN_DTNH",
          `ENFIDHA ZINE EL ABIDINE BEN ALI` = "LU_ELLX_TN_DTNZ",
          `DJERBA/ZARZIS` = "LU_ELLX_TN_DTTJ",
          `ANTALYA (MIL-CIV)` = "LU_ELLX_TR_LTAI",
          `ISTANBUL/ATATURK` = "LU_ELLX_TR_LTBA",
          `SYSTEM - LONDON` = "LU_ELLX_UK_EG90",
          `MANCHESTER` = "LU_ELLX_UK_EGCC",
          `LONDON GATWICK` = "LU_ELLX_UK_EGKK",
          `LONDON/CITY` = "LU_ELLX_UK_EGLC",
          `LONDON HEATHROW` = "LU_ELLX_UK_EGLL",
          `LONDON STANSTED` = "LU_ELLX_UK_EGSS",
          `NEWARK LIBERTY INTERNATIONAL, NJ.` = "LU_ELLX_US_KEWR",
          `O.R TAMBO INTERNATIONAL` = "LU_ELLX_ZA_FAJS"
        )
      )
)

# Step 6: Final cleaned dataset
avia_clean <- rxp_r(
  name = avia_clean,
  expr =
    avia_recode_destination %>%
      mutate(passengers = as.numeric(passengers)) %>%
      select(unit, tra_meas, destination, date, passengers)
)

# Step 7: Quarterly arrivals
avia_clean_quarterly <- rxp_r(
  name = avia_clean_quarterly,
  expr =
    avia_clean %>%
      filter(
        tra_meas == "Passengers on board (arrivals)",
        !is.na(passengers),
        str_detect(date, "Q")
      ) %>%
      mutate(date = yq(date))
)

# Step 8: Monthly arrivals
avia_clean_monthly <- rxp_r(
  name = avia_clean_monthly,
  expr =
    avia_clean %>%
      filter(
        tra_meas == "Passengers on board (arrivals)",
        !is.na(passengers),
        str_detect(date, "M")
      ) %>%
      mutate(date = ymd(paste0(date, "01"))) %>%
      select(destination, date, passengers)
)

# Populate and build the pipeline
rxp_populate(
  list(
    avia,
    avia_long,
    avia_split,
    avia_recode_tra_meas,
    avia_recode_unit,
    avia_recode_destination,
    avia_clean,
    avia_clean_quarterly,
    avia_clean_monthly
  )
)

rxp_make()

Now this is a faithful “translation” of the script into a {rixpress} pipeline, however, the original data is now not available anymore, and recent data sets have changed slightly, which means that this script would need further adaptation to the current data source. Otherwise, this would be it! You can view the updated script here (I have also removed all the recoding of factors, because there seems to be something wrong with how {rixpress} handles `, so writing this blog post actually help me find something to fix!)

Generating the environment

I also used an LLM to generate the {rix} script that sets up the reproducible environment for this pipeline. I gave it the rix.pkgctx.yaml context file (generated with pkgctx r rix > rix.pkgctx.yaml, which is also available on the rix GitHub repo) and asked: “Using this knowledge, write me an R script that uses rix to set up the right default.nix for this pipeline.”

The LLM correctly identified the packages needed from the pipeline code:

  • readr (for read_tsv)
  • dplyr (for select, filter, mutate, %>%)
  • tidyr (for gather, separate)
  • forcats (for fct_recode)
  • lubridate (for yq, ymd)
  • stringr (for str_detect)
  • rixpress (for the pipeline itself)

And produced this script:

library(rix)

rix(
  date = "2026-01-10",
  r_pkgs = c(
    "readr",
    "dplyr",
    "tidyr",
    "forcats",
    "lubridate",
    "stringr",
    "rixpress"
  ),
  ide = "none",
  project_path = ".",
  overwrite = TRUE
)

There’s only one issue with that script: the selected date is not valid, it should instead be the 12th of January. But that’s actually my fault: the LLM had no way of knowing that. The only way it could have known is if I had told it to look at the csv file that lists all the valid dates on {rix}’s repository. But after changing the date, it becomes possible to run this script, then nix-build to build the environment and nix-shell to drop into it. From there, run your pipeline.

What we’ve done here is use LLMs at every step:

  1. Gave context about rixpress (via pkgctx) and asked the LLM to translate my old script into a pipeline
  2. Gave context about rix (via pkgctx) and asked the LLM to generate the environment setup

The pattern is always the same: context + scoped task = useful output.

Structure + context = outsourceable grunt work

The point I’m making here isn’t really about {rixpress} pipelines specifically. It’s about a broader principle that both Davis Vaughan and I have observed: LLMs are genuinely useful when you give them enough structure and context.

Davis pre-cloned repositories, pre-generated .Rprofile files, and pre-created task lists so Claude could focus on the actual fixes rather than git management. I used pkgctx to give the LLM a complete API specification and provided a clear starting point (my old script). In both cases, the formula is the same:

Structure + Context → Scoped Task → LLM can actually help

I’ve written before about how you can outsource grunt work to an LLM, but not expertise. The same applies here. I still had to know what data transformations I needed. I still had to review the output and make adjustments. But the tedious restructuring (turning a monolithic script into a declarative pipeline) is exactly the kind of work LLMs can handle if you set them up properly.

If you want LLMs to help with your data science work:

  1. Give them context. Use tools like pkgctx to feed them API specifications. Paste your existing code. Show them examples.
  2. Scope the task tightly. “Translate this script into a rixpress pipeline” is a well-defined task. “Make my code better” is not.
  3. Review the output. LLMs do grunt work; you provide expertise.

If you’re not familiar with {rixpress}, check out my announcement post or the CRAN release post. And if you want to give LLMs context about R or Python packages, pkgctx is there to help. For those who want to dive deeper into Nix, {rix}, and {rixpress}, I’ve recently submitted a paper to the Journal of Statistical Software, which you can read here. For more examples of {rixpress} pipelines, check out the rixpress_demos repository.

LLMs aren’t going anywhere: the genie is out of the bottle. I still see plenty of people online claiming that LLMs aren’t useful, but I genuinely believe it comes down to one of two things:

  • They’re not providing enough context or scoping their tasks well enough.
  • They have a principled objection to LLMs, AI, and automation in general which, ok, whatever, but it’s not a technical argument about usefulness.

Some people might even say that to feel good about themselves: what I program is much too complex and important for mere LLMs to be able to help me. Ok perhaps, but not all of us are working for NASA or whatever. I’ll keep on outsourcing the tedious grunt work to LLMs.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: From scripts to pipelines in the age of LLMs]]>

398257


Open call for the rOpenSci Champions Program 2026! https://www.r-bloggers.com/2026/01/open-call-for-the-ropensci-champions-program-2026/

Mon, 12 Jan 2026 00:00:00 +0000
https://ropensci.org/blog/2026/01/12/programchamps2026/

Read it in: Español. We are pleased to announce the opening of a new call for applications for the rOpenSci Champions Program in Spanish, which will begin in 2026. We will be accepting applications beginning in January 12, 2026 and until February 20, …

Continue reading: Open call for the rOpenSci Champions Program 2026!]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Read it in: Español.

We are pleased to announce the opening of a new call for applications for the rOpenSci Champions Program in Spanish, which will begin in 2026. We will be accepting applications beginning in January 12, 2026 and until February 20, 2026 for both the roles of Champion as well as Mentor.

As in the previous cohort, the 2026 program will be developed entirely in Spanish and will have a regional focus on Latin America with the objective of further strengthening the research and open science software in this region.

Key dates of the 2026 call

The Community Call on January 21 will feature the participation of Champions and Mentors from previous cohorts. They will share their experiences and answer questions about the program, and we invite you to join us! See the recording of last year’s event.

During February we will hold one application clinic. It is an open space where you will be able to receive help to complete the application form, resolve doubts and receive direct guidance from the program team.

What is the Champions Program?

This program seeks to identify, support and recognize people who are already leading, or who want to take a step further, in building open science and sustainable research software communities.

Throughout 12 months the selected individuals will participate in:

The program also offers a stipend to recognize the time and work of participants for those who complete the program and a certificate of participation.

Who is it for?

Champions

Potential Champions are people who:

Mentors

Potential Mentors are people who:

Why participate?

Being part of the rOpenSci Champions Program means:

How to apply?

Applications will be made through online forms:

These forms are also available on the program website.

The forms must be completed in Spanish. On the program’s website you will also find more details regarding the requirements for Champions and Mentors, as well as answers to frequently asked questions.

Learn more about our work

On our Champion’s Program website you can find the detailed schedule of activities and the complete information of the program, as well as projects carried out by previous cohorts and the other types of activities carried out by the participants.

If you have any questions, we invite you to participate in the Community Call in January, join us in the application clinic in February, or contact our Community Manager.

We look forward to your applications with great enthusiasm! We want to continue building community in Latin America together with you.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Open call for the rOpenSci Champions Program 2026!]]>

398223


Retrieval-Augmented Generation: Setting up a Knowledge Store in R https://www.r-bloggers.com/2026/01/retrieval-augmented-generation-setting-up-a-knowledge-store-in-r/

Thu, 08 Jan 2026 23:59:00 +0000
https://www.jumpingrivers.com/blog/retrieval-augmented-generation-database-workflow-r/

Happy New Year from the team at Jumping Rivers!
As we move through the midpoint of the 2020s, it’s a good time to
reflect on the changes that we have seen so far in this decade. In the
world of data science nothing has dominated headlines quite l…

Continue reading: Retrieval-Augmented Generation: Setting up a Knowledge Store in R]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Happy New Year from the team at Jumping Rivers!

As we move through the midpoint of the 2020s, it’s a good time to reflect on the changes that we have seen so far in this decade. In the world of data science nothing has dominated headlines quite like the rapid growth and uptake of generative artificial intelligence (GenAI).

Large language models (LLMs) such as ChatGPT, Claude and Gemini have incredible potential to streamline day-to-day tasks, whether that’s processing vast amounts of information, providing a human-like chat interface for customers or generating code. But they also come with notable risks if not harnessed responsibly.

Anyone that has interacted with these models is likely to have come across hallucination, where the model confidently presents false information as though it is factually correct. This can happen for a variety of reasons:

Often we need to give the model access to additional contextual information before we can make it “production-ready”. We can achieve this using a retrieval-augmented generation (RAG) workflow. In this blog post we will explore the steps involved and set up an example RAG workflow using free and open source packages in R.

What is RAG?

In a typical interaction with an LLM we have:

In a RAG workflow we provide access to an external knowledge store which can include text-based documents and webpages. Additional contextual info is then retrieved from the knowledge store (hence “retrieval”) and added to the user prompt before it is sent. In doing so we can expect to receive a higher quality output.

How does it work?

Before going further, we must first introduce the concept of vectorisation.

Contrary to what you might believe, LLMs do not understand non-numerical text! They are mathematical models, meaning they can only ingest and output numerical vectors.

So how can a user interact with a model using plain English? The trick is that mappings exist which are able to convert between numerical vectors and text. These mappings are called “vector embeddings” and are used to convert the user prompt into a vector representation before it is passed to the LLM.

So, when setting up our RAG knowledge store, we have to store the information using a compatible vector representation. With this in mind, let’s introduce a typical RAG workflow:

  1. Content: we decide which documents to include in the knowledge store.
  2. Extraction: we extract the text from these documents in Markdown format.
  3. Chunking: the Markdown content is split into contextual “chunks” (for example, each section or subsection of a document might become a chunk).
  4. Vectorisation: the chunks are “vectorised” (i.e. we convert them into a numerical vector representation).
  5. Index: we create an index for our knowledge store which will be used to retrieve relevant chunks of information.
  6. Retrieval: we register the knowledge store with our model interface. Now, when a user submits a prompt, it will be combined with relevant chunks of information before it is ingested by the model.

At the retrieval step, a matching algorithm is typically used so that only highly relevant chunks are retrieved from the knowledge store. In this way, we are able to keep the size of the user prompts (and any incurred costs) to a minimum.

Setting up a RAG workflow in R

We will make use of two packages which are available to install via the Comprehensive R Archive Network (CRAN). Both are actively maintained by Posit (formerly RStudio) and are free to install and use.

{ragnar}

The {ragnar} package provides functions for extracting information from both text-based documents and webpages, and provides vector embeddings that are compatible with popular LLM providers including OpenAI and Google.

We will use {ragnar} to build our knowledge store.

{ellmer}

The {ellmer} package allows us to interact with a variety of LLM APIs from R. A complete list of supported model providers can be found in the package documentation.

Note that, while {ellmer} is free to install and use, you will still need to set up an API token with your preferred model provider before you can interact with any models. We will use the free Google Gemini tier for our example workflow. See the Gemini API documentation creating an API key, and the {ellmer} documentation for authenticating with your API key from R.

Example RAG workflow

We begin by loading the {ragnar} package.

library("ragnar")

The URL provided below links to the title page of the “Efficient R Programming” textbook, written by Robin Lovelace and our very own Colin Gillespie. We’re going to use a couple of chapters from the book to construct a RAG knowledge store.

url = "https://csgillespie.github.io/efficientR/"

Let’s use {ragnar} to read the contents of this page into a Markdown format.

md = read_as_markdown(url)

We could vectorise this information as it is, but first we should split it up into contextual chunks.

chunks = markdown_chunk(md)
chunks
#> # @document@origin: https://csgillespie.github.io/efficientR/
#> # A tibble: 2 × 4
#> start end context text
#> * <int> <int> <chr> <chr>
#> 1 1 1572 "" "# Efficient R programmin…
#> 2 597 2223 "# Welcome to Efficient R Programming" "## Authorsnn[Colin Gil…

The chunks are stored in a tibble format, with one row per chunk. The text column stores the chunk text (in the interests of saving space we have only included the start of each chunk in the printed output above).

The title page has been split into two chunks and we can see that there is significant overlap (chunk 1 spans characters 1 to 1572 and chunk 2 spans characters 597 to 2223). Overlapping chunks are perfectly normal and provides added context as to where each chunk sits relative to the other chunks.

Note that you can visually inspect the chunks by running ragnar_chunks_view(chunks).

It’s time to build our knowledge store with a vector embedding that is appropriate for Google Gemini models.

# Initialise a knowledge store with the Google Gemini embedding
store = ragnar_store_create(
 embed = embed_google_gemini()
)

# Insert the Markdown chunks
ragnar_store_insert(store, chunks)

The Markdown chunks are automatically converted into a vector representation at the insertion step. It is important to use the appropriate vector embedding when we create the store. A knowledge store created using an OpenAI embedding will not be compatible with Google Gemini models!

Before we can retrieve information from our store, we must create a store index.

ragnar_store_build_index(store)

We can now test the retrieval capabilities of our knowledge store using the ragnar_retreive() function. For example, to retrieve any chunks relevant to the text Who are the authors of “Efficient R Programming”? we can run:

relevant_knowledge = ragnar_retrieve(
 store,
 text = "Who are the authors of "Efficient R Programming"?"
)
relevant_knowledge
#> # A tibble: 1 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text
#> <chr> <int> <list> <int> <int> <list> <lis> <chr> <chr>
#> 1 https://csgi… 1 <int> 1 2223 <dbl [2]> <dbl> "" "# E…

Note that the operators in "Efficient R Programming" have been used to print raw double quotes in the character string.

Without going into too much detail, the cosine_distance and bm25 columns in the returned tibble provide information relating to the matching algorithm used to identify the chunks. The other columns relate to the location and content of the chunks.

From the output tibble we see that the full content of the title page (characters 1 to 2223) has been returned. This is because the original two chunks both contained information about the authors.

Let’s add a more technical chapter from the textbook to the knowledge store. The URL provided below links to Chapter 7 (“Efficient Optimisation”). Let’s add this to the knowledge store and rebuild the index.

url = "https://csgillespie.github.io/efficientR/performance.html"

# Extract Markdown content and split into chunks
chunks = url |>
 read_as_markdown() |>
 markdown_chunk()

# Add the chunks to the knowledge store
ragnar_store_insert(store, chunks)

# Rebuild the store index
ragnar_store_build_index(store)

Now that our knowledge store includes content from both the title page and Chapter 7, let’s ask something more technical, like What are some good practices for parallel computing in R?.

relevant_knowledge = ragnar_retrieve(
 store,
 text = "What are some good practices for parallel computing in R?"
)
relevant_knowledge
#> # A tibble: 4 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text
#> <chr> <int> <list> <int> <int> <list> <lis> <chr> <chr>
#> 1 https://csgi… 1 <int> 1 2223 <dbl [2]> <dbl> "" "# E…
#> 2 https://csgi… 2 <int> 1 1536 <dbl [1]> <dbl> "" "# 7…
#> 3 https://csgi… 2 <int> 22541 23995 <dbl [1]> <dbl> "# 7 E… "## …
#> 4 https://csgi… 2 <int> 23996 26449 <dbl [2]> <dbl> "# 7 E… "The…

Four chunks have been returned:

It makes sense that we have chunks from Section 7.5, which appears to be highly relevant to the question. By including the title page and the start of Chapter 7, the LLM will also have access to useful metadata in case the user wants to find out where the model is getting its information from.

Now that we have built and tested our retrieval tool, it’s time to connect it up to a Gemini interface using {ellmer}. The code below will create a chat object allowing us to send user prompts to Gemini.

chat = ellmer::chat_google_gemini(
 system_prompt = "You answer in approximately 10 words or less."
)

A system prompt has been included here to ensure a succinct response from the model API.

We can register this chat interface with our retrieval tool.

ragnar_register_tool_retrieve(chat, store)

To check if our RAG workflow has been set up correctly, let’s chat with the model.

chat$chat("What are some good practices for parallel computing in R?")
#> Use the `parallel` package, ensure you stop clusters with `stopCluster()` (or
#> `on.exit()`), and utilize `parLapply()`, `parApply()`, or `parSapply()`.

The output looks plausible. Just to make sure, let’s check where the model found out this information.

chat$chat("Where did you get that answer from?")
#> I retrieved the information from "Efficient R programming" by Colin Gillespie
#> and Robin Lovelace.

Success! The LLM has identified the name of the textbook and if we wanted to we could even ask about the specific chapter. A user interacting with our model interface could now search online for this textbook to fact-check the responses.

In the example workflow above, we manually selected a couple of chapters from the textbook to include in our knowledge store. It’s worth noting that you can also use the ragnar_find_links(url) function to retrieve a list of links from a given webpage.

Doing so for the title page will provide the links to all chapters.

ragnar_find_links("https://csgillespie.github.io/efficientR/")
#> [1] "https://csgillespie.github.io/efficientR/"
#> [2] "https://csgillespie.github.io/efficientR/building-the-book-from-source.html"
#> [3] "https://csgillespie.github.io/efficientR/collaboration.html"
#> [4] "https://csgillespie.github.io/efficientR/data-carpentry.html"
#> [5] "https://csgillespie.github.io/efficientR/hardware.html"
#> [6] "https://csgillespie.github.io/efficientR/index.html"
#> [7] "https://csgillespie.github.io/efficientR/input-output.html"
#> [8] "https://csgillespie.github.io/efficientR/introduction.html"
#> [9] "https://csgillespie.github.io/efficientR/learning.html"
#> [10] "https://csgillespie.github.io/efficientR/performance.html"
#> [11] "https://csgillespie.github.io/efficientR/preface.html"
#> [12] "https://csgillespie.github.io/efficientR/programming.html"
#> [13] "https://csgillespie.github.io/efficientR/references.html"
#> [14] "https://csgillespie.github.io/efficientR/set-up.html"
#> [15] "https://csgillespie.github.io/efficientR/workflow.html"

You could then iterate through these links, extracting the contents from each webpage and inserting these into your RAG knowledge store. Just note, however, that including additional information in your store will likely increase the amount of text being sent to the model, which could raise costs. You should therefore think about what information is actually relevant for your LLM application.

Summary

In summary, we have introduced the concept of retrieval-augmented generation for LLM-powered workflows and built an example workflow in R using open source packages.

Before finishing, we are excited to announce that our new course “LLM-Driven Applications with R & Python” has just been added to our training portfolio. You can search for it here.

If you’re interested in practical AI-driven workflows, we would love to see you at our upcoming AI In Production 2026 conference which is running from 4-5 June in Newcastle-Upon-Tyne. If you would like to present a talk or workshop, please submit your abstracts before the deadline on 23 January.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Retrieval-Augmented Generation: Setting up a Knowledge Store in R]]>

398146


Survivor 49 is now in 📦{survivoR} https://www.r-bloggers.com/2026/01/survivor-49-is-now-in-%f0%9f%93%a6survivor/

Thu, 08 Jan 2026 19:49:48 +0000
http://gradientdescending.com/?p=3486

Survivor 49 has wrapped up and has been added to the {survivoR} package. It is available on If you find […]
The post Survivor 49 is now in 📦{survivoR} appeared first on Dan Oehm | Gradient Descending.

Continue reading: Survivor 49 is now in 📦{survivoR}]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on R Archives – Dan Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
To leave a comment for the author, please follow the link and comment on their blog: R Archives – Dan Oehm | Gradient Descending.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Survivor 49 is now in 📦{survivoR}]]>

398181


rtopy: an R to Python bridge — novelties https://www.r-bloggers.com/2026/01/rtopy-an-r-to-python-bridge-novelties/

Thu, 08 Jan 2026 00:00:00 +0000
https://thierrymoudiki.github.io//blog/2026/01/08/r/python/rtopy

rtopy: an R to Python bridge — novelties

Continue reading: rtopy: an R to Python bridge — novelties]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on T. Moudiki’s Webpage – R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

In this post, I present the novelties of python package rtopy; a package allowing (whose ultimate objective is to) translate R to Python without much hassle. The intro is still available in available in https://thierrymoudiki.github.io/blog/2024/03/04/python/r/rtopyintro.

The novelties mainly concern the RBridge class and the call_r function. The RBridge class is more about persistency, while the call_r function is more about ease of use.

See for yourself in the following – hopefully comprehensive – examples (classification, regression, time series, hypothesis testing).

contents

  1. Installation
  2. RBridge class
  3. call_r function
  4. Advanced RBridge Usage Examples
 %load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

%%R

install.packages("pak")
pak::pak(c("e1071", "forecast", "randomForest"))

library(jsonlite)

!pip install rtopy

"""
Advanced RBridge Usage Examples
================================

Demonstrates using R packages, statistical modeling, and data processing
through the Python-R bridge `rtopy`.
"""

import numpy as np
import pandas as pd
from rtopy import RBridge, call_r


# ============================================================================
# Example 1: Support Vector Machine with e1071
# ============================================================================
print("=" * 70)
print("Example 1: SVM Classification with e1071")
print("=" * 70)

# Generate training data
np.random.seed(42)
n_samples = 100

# Class 0: centered at (-1, -1)
X0 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([-1, -1])
# Class 1: centered at (1, 1)
X1 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([1, 1])

X_train = np.vstack([X0, X1])
y_train = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Create R code for SVM training and prediction
svm_code = '''
library(e1071)

train_svm <- function(X, y, kernel_type = "radial") {
    # Convert to data frame
    df <- data.frame(
        x1 = X[, 1],
        x2 = X[, 2],
        y = as.factor(y)
    )

    # Train SVM
    model <- e1071::svm(y ~ x1 + x2, data = df, kernel = kernel_type, cost = 1)

    # Make predictions on training data
    predictions <- predict(model, df)

    # Calculate accuracy
    accuracy <- mean(predictions == df$y)

    # Return results
    list(
        predictions = as.numeric(as.character(predictions)),
        accuracy = accuracy,
        n_support = model$tot.nSV
    )
}
'''

rb = RBridge(verbose=True)
result = rb.call(
    svm_code,
    "train_svm",
    return_type="dict",
    X=X_train,
    y=y_train,
    kernel_type="radial"
)

print(f"Training Accuracy: {result['accuracy']:.2%}")
print(f"Number of Support Vectors: {result['n_support']}")
print(f"Sample Predictions: {result['predictions'][:10]}")


# ============================================================================
# Example 2: Time Series Analysis with forecast package
# ============================================================================
print("n" + "=" * 70)
print("Example 2: Time Series Forecasting with forecast")
print("=" * 70)

# Generate time series data
time_series = np.sin(np.linspace(0, 4*np.pi, 50)) + np.random.randn(50) * 0.1

ts_code = '''
library(forecast)

forecast_ts <- function(x, h = 10) {
    # Convert to time series object
    ts_data <- ts(x, frequency = 12)

    # Fit ARIMA model
    fit <- auto.arima(ts_data, seasonal = FALSE)

    # Generate forecast
    fc <- forecast(fit, h = h)

    # Return results
    list(
        forecast_mean = as.numeric(fc$mean),
        forecast_lower = as.numeric(fc$lower[, 2]),  # 95% CI
        forecast_upper = as.numeric(fc$upper[, 2]),
        model_aic = fit$aic,
        model_order = paste0("ARIMA(",
                            paste(arimaorder(fit), collapse = ","),
                            ")")
    )
}
'''

result = rb.call(
    ts_code,
    "forecast_ts",
    return_type="dict",
    x=time_series.tolist(),
    h=10
)

print(f"Model: {result['model_order']}")
print(f"AIC: {result['model_aic']:.2f}")
print(f"5-step forecast: {np.array(result['forecast_mean'])[:5]}...")


# ============================================================================
# Example 3: Random Forest with randomForest package
# ============================================================================
print("n" + "=" * 70)
print("Example 3: Random Forest Regression")
print("=" * 70)

# Generate regression data
np.random.seed(123)
X = np.random.rand(200, 3) * 10
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(200) * 2

rf_code = '''
library(randomForest)

train_rf <- function(X, y, ntree = 500) {
    # Create data frame
    df <- data.frame(
        x1 = X[, 1],
        x2 = X[, 2],
        x3 = X[, 3],
        y = y
    )

    # Train random forest
    rf_model <- randomForest(y ~ ., data = df, ntree = ntree, importance = TRUE)

    # Get predictions
    predictions <- predict(rf_model, df)

    # Calculate R-squared
    r_squared <- 1 - sum((y - predictions)^2) / sum((y - mean(y))^2)

    # Get feature importance
    importance_scores <- importance(rf_model)[, 1]  # %IncMSE

    list(
        r_squared = r_squared,
        mse = rf_model$mse[ntree],
        predictions = predictions,
        importance = importance_scores
    )
}
'''

result = rb.call(
    rf_code,
    "train_rf",
    return_type="dict",
    X=X,
    y=y.tolist(),
    ntree=500
)

print(f"R-squared: {result['r_squared']:.3f}")
print(f"MSE: {result['mse']:.3f}")
print(f"Feature Importance: {result['importance']}")


# ============================================================================
# Example 4: Statistical Tests with stats package
# ============================================================================
print("n" + "=" * 70)
print("Example 4: Statistical Hypothesis Testing")
print("=" * 70)

# Generate two samples
group1 = np.random.normal(5, 2, 50)
group2 = np.random.normal(6, 2, 50)

stats_code = '''
perform_tests <- function(group1, group2) {
    # T-test
    t_result <- t.test(group1, group2)

    # Wilcoxon test (non-parametric alternative)
    w_result <- wilcox.test(group1, group2)

    # Kolmogorov-Smirnov test
    ks_result <- ks.test(group1, group2)

    list(
        t_test = list(
            statistic = t_result$statistic,
            p_value = t_result$p.value,
            conf_int = t_result$conf.int
        ),
        wilcox_test = list(
            statistic = w_result$statistic,
            p_value = w_result$p.value
        ),
        ks_test = list(
            statistic = ks_result$statistic,
            p_value = ks_result$p.value
        ),
        summary_stats = list(
            group1_mean = mean(group1),
            group2_mean = mean(group2),
            group1_sd = sd(group1),
            group2_sd = sd(group2)
        )
    )
}
'''

result = rb.call(
    stats_code,
    "perform_tests",
    return_type="dict",
    group1=group1.tolist(),
    group2=group2.tolist()
)

print(f"Group 1 Mean: {result['summary_stats']['group1_mean']:.2f} ± {result['summary_stats']['group1_sd']:.2f}")
print(f"Group 2 Mean: {result['summary_stats']['group2_mean']:.2f} ± {result['summary_stats']['group2_sd']:.2f}")
print(f"nT-test p-value: {result['t_test']['p_value']:.4f}")
print(f"Wilcoxon p-value: {result['wilcox_test']['p_value']:.4f}")


# ============================================================================
# Example 5: Data Transformation with dplyr
# ============================================================================
print("n" + "=" * 70)
print("Example 5: Data Wrangling with dplyr")
print("=" * 70)

# Create sample dataset
data = pd.DataFrame({
    'id': range(1, 101),
    'group': np.random.choice(['A', 'B', 'C'], 100),
    'value': np.random.randn(100) * 10 + 50,
    'score': np.random.randint(1, 101, 100)
})

dplyr_code = '''
library(dplyr)

process_data <- function(df) {
    # Convert list columns to data frame
    data <- as.data.frame(df)

    # Perform dplyr operations
    result <- data %>%
        filter(score > 50) %>%
        group_by(group) %>%
        summarise(
            n = n(),
            mean_value = mean(value),
            median_score = median(score),
            sd_value = sd(value)
        ) %>%
        arrange(desc(mean_value))

    # Convert back to list format for JSON
    as.list(result)
}
'''

result = rb.call(
    dplyr_code,
    "process_data",
    return_type="pandas",
    df=data
)

print("nGrouped Summary Statistics:")
print(result)


# ============================================================================
# Example 6: Clustering with cluster package
# ============================================================================
print("n" + "=" * 70)
print("Example 6: K-means and Hierarchical Clustering")
print("=" * 70)

# Generate clustered data
np.random.seed(42)
cluster_data = np.vstack([
    np.random.randn(30, 2) * 0.5 + np.array([0, 0]),
    np.random.randn(30, 2) * 0.5 + np.array([3, 3]),
    np.random.randn(30, 2) * 0.5 + np.array([0, 3])
])

cluster_code = '''
library(cluster)

perform_clustering <- function(X, k = 3) {
    # Convert to matrix
    data_matrix <- as.matrix(X)

    # K-means clustering
    kmeans_result <- kmeans(data_matrix, centers = k, nstart = 25)

    # Hierarchical clustering
    dist_matrix <- dist(data_matrix)
    hc <- hclust(dist_matrix, method = "ward.D2")
    hc_clusters <- cutree(hc, k = k)

    # Silhouette analysis for k-means
    sil <- silhouette(kmeans_result$cluster, dist_matrix)
    avg_silhouette <- mean(sil[, 3])

    list(
        kmeans_clusters = kmeans_result$cluster,
        kmeans_centers = kmeans_result$centers,
        kmeans_withinss = kmeans_result$tot.withinss,
        hc_clusters = hc_clusters,
        silhouette_score = avg_silhouette
    )
}
'''

result = rb.call(
    cluster_code,
    "perform_clustering",
    return_type="dict",
    X=cluster_data,
    k=3
)

print(f"K-means Within-cluster SS: {result['kmeans_withinss']:.2f}")
print(f"Average Silhouette Score: {result['silhouette_score']:.3f}")
print(f"nCluster Centers:n{np.array(result['kmeans_centers'])}")
print(f"nCluster sizes: {np.bincount(result['kmeans_clusters'])}")


print("n" + "=" * 70)
print("All examples completed successfully!")
print("=" * 70)

======================================================================
Example 1: SVM Classification with e1071
======================================================================
Training Accuracy: 100.00%
Number of Support Vectors: 9
Sample Predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

======================================================================
Example 2: Time Series Forecasting with forecast
======================================================================
Model: ARIMA(3,1,0)
AIC: -10.21
5-step forecast: [0.29557391 0.4948255  0.64553023 0.80823028 0.93656539]...

======================================================================
Example 3: Random Forest Regression
======================================================================
R-squared: 0.972
MSE: 11.996
Feature Importance: [62.57255479535195, 86.55470841243113, 21.4933655703039]

======================================================================
Example 4: Statistical Hypothesis Testing
======================================================================
Group 1 Mean: 5.33 ± 2.06
Group 2 Mean: 5.37 ± 2.28

T-test p-value: 0.9381
Wilcoxon p-value: 0.8876

======================================================================
Example 5: Data Wrangling with dplyr
======================================================================

Grouped Summary Statistics:
  group   n  mean_value  median_score   sd_value
0     C  23   49.711861            76  11.367167
1     A  14   49.219788            74   9.744709
2     B  23   47.459312            80  10.126835

======================================================================
Example 6: K-means and Hierarchical Clustering
======================================================================
K-means Within-cluster SS: 39.38
Average Silhouette Score: 0.713

Cluster Centers:
[[-0.03545142  3.12736567]
 [ 2.9470395   3.04927708]
 [-0.07207628 -0.0825784 ]]

Cluster sizes: [ 0 30 30 30]

======================================================================
All examples completed successfully!
======================================================================

import matplotlib.pyplot as plt
import seaborn as sns

# Set a style for better aesthetics
sns.set_style("whitegrid")

# Create a scatter plot of the clustered data
plt.figure(figsize=(10, 7))
sns.scatterplot(
    x=cluster_data[:, 0],
    y=cluster_data[:, 1],
    hue=result['kmeans_clusters'],
    palette='viridis',
    s=100, # size of points
    alpha=0.8, # transparency
    legend='full'
)

# Plot the cluster centers
centers = np.array(result['kmeans_centers'])
plt.scatter(
    centers[:, 0],
    centers[:, 1],
    marker='X',
    s=200, # size of centers
    color='red',
    edgecolors='black',
    label='Cluster Centers'
)

plt.title('K-means Clustering of Generated Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

image-title-here

from rtopy import RBridge, call_r

# ============================================================================
# Optional: SVM Classification (High vs Low Price)
# ============================================================================
print("n" + "=" * 70)
print("Optional: SVM Classification on Boston")
print("=" * 70)

svm_boston_class_code = '''
library(MASS)
library(e1071)

train_boston_svm_class <- function(kernel_type = "radial", cost = 1) {

    data(Boston)

    # Binary target: expensive vs cheap housing
    Boston$high_medv <- as.factor(ifelse(Boston$medv >
                                         median(Boston$medv), 1, 0))

    model <- svm(
        high_medv ~ . - medv,
        data = Boston,
        kernel = kernel_type,
        cost = cost,
        scale = TRUE
    )

    preds <- predict(model, Boston)

    accuracy <- mean(preds == Boston$high_medv)

    list(
        accuracy = accuracy,
        n_support = model$tot.nSV,
        confusion = table(
            predicted = preds,
            actual = Boston$high_medv
        )
    )
}
'''

result = rb.call(
    svm_boston_class_code,
    "train_boston_svm_class",
    return_type="dict",
    kernel_type="radial",
    cost=1
)

print(f"Classification Accuracy: {result['accuracy']:.2%}")
print(f"Number of Support Vectors: {result['n_support']}")
print("Confusion Matrix:")
print(result["confusion"])


======================================================================
Optional: SVM Classification on Boston
======================================================================
Classification Accuracy: 90.51%
Number of Support Vectors: 209
Confusion Matrix:
[[237, 29], [19, 221]]

# ============================================================================
# Optional: SVM Classification (High vs Low Price)
# ============================================================================
print("n" + "=" * 70)
print("Optional: SVM Classification on Boston")
print("=" * 70)

svm_boston_class_code = '''
library(MASS)
library(e1071)

train_boston_svm_class <- function(kernel_type = "radial", cost = 1) {

    data(Boston)

    # Binary target: expensive vs cheap housing
    Boston$high_medv <- as.factor(ifelse(Boston$medv >
                                         median(Boston$medv), 1, 0))

    model <- svm(
        high_medv ~ . - medv,
        data = Boston,
        kernel = kernel_type,
        cost = cost,
        scale = TRUE
    )

    preds <- predict(model, Boston)

    accuracy <- mean(preds == Boston$high_medv)

    list(
        accuracy = accuracy,
        n_support = model$tot.nSV,
        confusion = table(
            predicted = preds,
            actual = Boston$high_medv
        )
    )
}
'''

result = rb.call(
    svm_boston_class_code,
    "train_boston_svm_class",
    return_type="dict",
    kernel_type="radial",
    cost=1
)

print(f"Classification Accuracy: {result['accuracy']:.2%}")
print(f"Number of Support Vectors: {result['n_support']}")
print("Confusion Matrix:")
print(result["confusion"])


======================================================================
Optional: SVM Classification on Boston
======================================================================
Classification Accuracy: 90.51%
Number of Support Vectors: 209
Confusion Matrix:
[[237, 29], [19, 221]]
To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki’s Webpage – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: rtopy: an R to Python bridge — novelties]]>

398159


Directional markers in R/leaflet https://www.r-bloggers.com/2026/01/directional-markers-in-r-leaflet/

Wed, 07 Jan 2026 22:25:16 +0000
http://nsaunders.wordpress.com/?p=5702

So you have used the excellent exiftool to extract all of the GPS-related information from a directory of photos in JPG format and write to a CSV file: You’ve used R/leaflet to plot coordinates (latitude and longitude) before, but what about that tag named GPSImgDirection? It would be …

Continue reading: Directional markers in R/leaflet]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

So you have used the excellent exiftool to extract all of the GPS-related information from a directory of photos in JPG format and write to a CSV file:

exiftool '-*GPS*' -ext jpg -csv . > outfile.csv

You’ve used R/leaflet to plot coordinates (latitude and longitude) before, but what about that tag named GPSImgDirection? It would be nice to have some kind of marker which indicates the direction in which you were facing when the photo was taken.

For me, a Google search provided hints but not one single, obvious straightforward solution to this problem (the generative AI effect? time will tell…), so here’s what I’ve put together from several sources, in particular this StackOverflow post.

The key points are:

Here’s some example code which uses the Font Awesome icon long-arrow-up. Since “up” (north) corresponds to zero degrees, applying a rotation corresponding to GPSImgDirection should result in the correct orientation for the marker. The GPS-related tags in this case come from an iPhone 13.

library(readr)
library(leaflet)
library(sp)

outfile <- read_csv("outfile.csv")

# create dataset
# ugly but it works
dataset <- outfile %>%
  mutate(GPSLatitude = str_replace(GPSLatitude, " deg", "d"),
         GPSLatitude = GPSLatitude %>%
         char2dms() %>%
           as.numeric(),
         GPSLongitude = str_replace(GPSLongitude, " deg", "d"),
         GPSLongitude = GPSLongitude %>%
           char2dms() %>%
           as.numeric(),
         GPSHPositioningError = str_replace(GPSHPositioningError, " m", ""),
         GPSHPositioningError = GPSHPositioningError %>%
           as.numeric()) %>%
  select(latitude = GPSLatitude,
         longitude = GPSLongitude,
         GPSTimeStamp,
         GPSImgDirection,
         GPSHPositioningError)

# create the marker icons
icons <- awesomeIcons(iconRotate = dataset$GPSImgDirection,
                      icon = "long-arrow-up",
                      library = "fa",
                      markerColor = "white",
                      squareMarker = TRUE)

# create map
# can filter on positioning error if desired
leaflet(data = dataset %>%
  addProviderTiles(provider = providers$CartoDB.Positron) %>%
  addAwesomeMarkers(icon = icons, label = ~GPSTimeStamp)

Here’s a screenshot of the resulting interactive map.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Directional markers in R/leaflet]]>


398155


Rugby Analytics with R: Complete Guide to Performance Analysis in Rugby Union and League https://www.r-bloggers.com/2026/01/rugby-analytics-with-r-complete-guide-to-performance-analysis-in-rugby-union-and-league/

Wed, 07 Jan 2026 19:46:55 +0000
https://rprogrammingbooks.com/?p=2391

Rugby is a sport defined by collisions, structure, and constant tactical adaptation. Unlike many other invasion sports, rugby alternates between highly structured moments—scrums, lineouts, restarts—and extended passages of chaotic open play. Each phase generates rich performance data: tackles, rucks, carries, kicks, meters gained, penalties conceded, turnovers, and spatial …

Continue reading: Rugby Analytics with R: Complete Guide to Performance Analysis in Rugby Union and League]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Blog – R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Rugby is a sport defined by collisions, structure, and constant tactical adaptation. Unlike many other invasion sports, rugby alternates between highly structured moments—scrums, lineouts, restarts—and extended passages of chaotic open play. Each phase generates rich performance data: tackles, rucks, carries, kicks, meters gained, penalties conceded, turnovers, and spatial changes in territory. Despite this richness, rugby analytics has historically lagged behind other sports, especially in terms of open, reproducible analytical workflows.

This gap presents a clear opportunity. R provides a complete environment for rugby performance analysis: data acquisition, cleaning, modeling, visualization, and automated reporting. For analysts, sports scientists, and coaches, R enables evidence-based decision-making that goes far beyond traditional statistics and subjective video review.

Why rugby analytics requires a different analytical mindset

Rugby is not a possession-by-possession sport in the same way as basketball, nor a continuous-flow game like football. Possession can be short or long, territory often matters more than time on the ball, and a single penalty can flip match momentum. Analytics must therefore respect rugby’s unique structure.

Simple totals—tackles, carries, meters—are insufficient on their own. Analysts must consider game state, field position, opposition quality, and player role. R makes it possible to incorporate this context systematically and consistently across matches and seasons.

Data acquisition in rugby: scraping, APIs, and internal feeds

Public rugby data is fragmented and inconsistent. Analysts often combine multiple sources to build a usable dataset. R is particularly well suited to this challenge because it supports web scraping, API consumption, and database integration within a single workflow.

# Core libraries for rugby data acquisition
library(tidyverse)
library(rvest)
library(httr)
library(jsonlite)

# Example: pulling match data from an API
response <- GET("https://api.example.com/rugby/match/9876")
raw_json <- content(response, "text")
match_data <- fromJSON(raw_json)

Web scraping is often necessary when APIs are unavailable. This requires careful handling of HTML structure, rate limits, and data validation to ensure accuracy and reproducibility.

# Scraping a match statistics table
page <- read_html("https://example-rugby-site.com/match/9876")

team_stats <- page %>%
  html_node("table.match-stats") %>%
  html_table()

team_stats

Data cleaning and validation: a critical but underestimated step

Rugby datasets are rarely analysis-ready. Player substitutions, injury replacements, and data entry inconsistencies introduce errors that can distort results if left unchecked.

# Standardizing and validating team statistics
team_stats_clean <- team_stats %>%
  janitor::clean_names() %>%
  mutate(across(where(is.character), str_trim)) %>%
  mutate(
    possession = as.numeric(possession),
    territory = as.numeric(territory)
  )

# Basic validation check
stopifnot(all(team_stats_clean$possession 

Validation logic should be embedded directly into the pipeline. This ensures that every new match is processed consistently, reducing human error and analyst workload.

Transforming events into rugby-specific units of analysis

Raw events are only the starting point. Meaningful rugby analysis requires transforming events into units such as phases, possessions, sets, and passages of play.

# Creating phase identifiers from ruck events
events <- events %>%
  arrange(match_id, event_time) %>%
  mutate(
    phase_id = cumsum(event_type == "ruck")
  )

# Summarising phase-level performance
phase_summary <- events %>%
  group_by(match_id, team, phase_id) %>%
  summarise(
    duration = max(event_time) - min(event_time),
    carries = sum(event_type == "carry"),
    meters = sum(meters_gained, na.rm = TRUE),
    turnovers = sum(event_type == "turnover"),
    .groups = "drop"
  )

These structures allow analysts to study momentum, ruck efficiency, and attacking intent in a way that aligns with how coaches understand the game.

Advanced player performance analysis with R

Player evaluation in rugby must be contextual and role-specific. Front-row players, halves, and outside backs contribute in fundamentally different ways.

# Player-level performance profile
player_profile <- events %>%
  group_by(player_id, player_name, position) %>%
  summarise(
    minutes_played = max(event_time) / 60,
    tackles = sum(event_type == "tackle"),
    missed_tackles = sum(event_type == "missed_tackle"),
    carries = sum(event_type == "carry"),
    meters = sum(meters_gained, na.rm = TRUE),
    offloads = sum(event_type == "offload"),
    penalties_conceded = sum(event_type == "penalty_conceded"),
    .groups = "drop"
  ) %>%
  mutate(
    tackles_per_min = tackles / minutes_played,
    meters_per_carry = meters / carries
  )

Rate-based metrics reveal impact more effectively than totals, especially when comparing starters to bench players or evaluating performance across different match contexts.

Defensive systems analysis: beyond individual tackles

Effective defense is systemic. Missed tackles often result from spacing errors, fatigue, or poor decision-making rather than individual incompetence.

# Defensive performance by field channel
defense_analysis <- events %>%
  filter(event_type %in% c("tackle", "missed_tackle")) %>%
  group_by(team, field_channel) %>%
  summarise(
    tackles = sum(event_type == "tackle"),
    misses = sum(event_type == "missed_tackle"),
    success_rate = tackles / (tackles + misses),
    .groups = "drop"
  )
Defensive analytics should highlight structural weaknesses and workload imbalances, not just individual error counts.

Territory, kicking strategy, and spatial dominance

Territory remains a core determinant of success in rugby. Teams that consistently win the territorial battle reduce defensive workload and increase scoring opportunities.

# Kicking distance and efficiency
kicks <- events %>%
  filter(event_type == "kick") %>%
  mutate(kick_distance = end_x - start_x)

kicking_summary <- kicks %>%
  group_by(team, kick_type) %>%
  summarise(
    avg_distance = mean(kick_distance, na.rm = TRUE),
    kicks = n(),
    .groups = "drop"
  )

Spatial analysis allows analysts to quantify whether a team’s kicking strategy aligns with its stated game model and environmental constraints.

Win probability and decision modeling in rugby

Win probability models convert complex match states into intuitive probabilities. In rugby, these models must account for score, time, territory, possession, and discipline risk.

# Building a basic win probability model
wp_data <- matches %>%
  mutate(
    score_diff = team_score - opponent_score,
    time_remaining = 80 - minute
  )

wp_model <- glm(
  win ~ score_diff + time_remaining + territory,
  data = wp_data,
  family = binomial()
)

summary(wp_model)

Even simple models provide immediate value by framing tactical decisions—such as kicking for touch versus taking the points—in probabilistic terms.

Automated reporting and reproducible workflows

The final step in rugby analytics is communication. R enables analysts to automate reporting, ensuring consistency and freeing time for deeper insight generation.

# Creating a clean match summary table
summary_table <- team_stats_clean %>%
  select(team, possession, territory, tackles, line_breaks, penalties_conceded)

knitr::kable(summary_table)

Automated reports ensure that analysis becomes part of the weekly rhythm rather than an optional extra.

The strategic opportunity in rugby analytics with R

There is clear and growing interest in rugby analytics, but very little comprehensive, R-focused content. Analysts, sports scientists, and coaches are actively searching for practical guidance.

A dedicated, end-to-end approach—covering data acquisition, performance metrics, modeling, and reporting—fills a genuine gap and establishes authority in a niche with minimal competition.

My book:

Rugby Analytics with R: Performance Analysis for Rugby Union & League

A complete, practical guide for applying R to real-world rugby performance analysis, designed for analysts, sports scientists, and coaches working in Rugby Union and Rugby League.

The post Rugby Analytics with R: Complete Guide to Performance Analysis in Rugby Union and League appeared first on R Programming Books.

To leave a comment for the author, please follow the link and comment on their blog: Blog – R Programming Books.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Rugby Analytics with R: Complete Guide to Performance Analysis in Rugby Union and League]]>

398134


R Studio or Positron? Time To Switch? https://www.r-bloggers.com/2026/01/r-studio-or-positron-time-to-switch/

Wed, 07 Jan 2026 08:00:00 +0000
https://ozancanozdemir.github.io/r/R-Studio-Positron

I remember the day that I started to use R programming. I had a basic interface to write and execute the code. After that experience, R Studio emerged as a powerful IDE for R programming for me. It provided a user-friendly interface, integrated tools…

Continue reading: R Studio or Positron? Time To Switch?]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on Ozancan Ozdemir, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

image

I remember the day that I started to use R programming. I had a basic interface to write and execute the code. After that experience, R Studio emerged as a powerful IDE for R programming for me. It provided a user-friendly interface, integrated tools, and features that enhance productivity and streamline the coding process and was a huge shift for me in my R programming journey.

In July 2022, R Studio was rebranded to Posit. Apparently, a new era was about to start because the world’s needs were evolving, and R had a stronger companion in the Python programming language.

image
R Studio Interface (Source: biocorecrg.github.io)

To satisfy the needs of both R and Python users, Posit introduced a new product called Positron. It is a data science-oriented IDE that supports both R and Python programming languages, in contrast to R Studio. Of course, this emerging tool has tempted some R Studio users who are also using VSCode since it offers some advantages over R Studio.

image
Positron Interface (Source: https://positron.posit.co/)

The main difference between Positron and R Studio is their multi-language support. Positron allows users to work with both R and Python in a single environment, making it easier for data scientists who use both languages. Additionally, Positron offers better integration with Jupyter Notebooks, which are widely used in the data science community.

AI-based assistants are also integrated into Positron, providing users with suggestions and code completions based on their coding patterns. This feature can significantly enhance productivity and reduce the learning curve for new users.

If you are playing with the data, it offers more flexibility and versatility compared to R Studio. You can examine not only the data frames on your enviroment, but also .csv and parquet without importing them.

Another advantage of Positron is to offer extensions that make the IDE more customizable and adaptable to different workflows. Users can install extensions to add new features, improve functionality, and tailor the environment to their specific needs.

Package versions and R versions crashes sometimes becomes annoying if you have encounter this during your R Studio experiment. But, with Positron, you can manage different R versions at the same time on the same machine without conflicts. This is particularly useful for users who work on multiple projects with different R version requirements.

Lastly, it is being improved continuously with frequent updates and new features being added regularly. This ensures that users have access to the latest tools and technologies in the data science field.

So the question is: Should we give up on using R Studio?

Actually, no. Because it is not going away, and it still provides some advantages over Positron.

R Studio still has strong properties that tempt users use it. You can use RMarkdown and Quarto to create dynamic documents, reports, and presentations that combine code, text, and visualizations. R Studio also has a robust ecosystem of packages and extensions that enhance its functionality and provide specialized tools for various data analysis tasks.

You can save and reload your workspace. Besides, you have several panels that help you to manage your files, plots, packages, and help documents easily. You can track your codes written in the past and bring them back easily without spending long time. And, you can import your datasets without typing code!

From a developer perspective, R Studio has specific tools that makes developing packages and app easier compared to Positron.

In conclusion, both Positron and R Studio have their own strengths and weaknesses. The choice between the two ultimately depends on the user’s specific needs and preferences. If you require multi-language support, better Jupyter integration, and AI-based assistance, Positron may be the better choice. However, if you prioritize RMarkdown, a robust package ecosystem, and workspace management, R Studio may be more suitable.

To leave a comment for the author, please follow the link and comment on their blog: Ozancan Ozdemir.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: R Studio or Positron? Time To Switch?]]>

398136


So how much does OpenAI owe us? https://www.r-bloggers.com/2026/01/so-how-much-does-openai-owe-us/

Wed, 07 Jan 2026 00:00:00 +0000
https://datascienceconfidential.github.io/economics/ai/llm/r/2026/01/07/so-how-much-does-openai-owe-us.html

Introduction: Copyright Law and Whatnot

I recently watched a clip from a debate between Timothy Nguyen of Google Deepmind and Danish author Janne Teller. The debate, entitled Technology and Freedom, took place at Hay-on-Wye in summer 2025. On the subj…

Continue reading: So how much does OpenAI owe us?]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on datascienceconfidential – r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Introduction: Copyright Law and Whatnot

I recently watched a clip from a debate between Timothy Nguyen of Google Deepmind and Danish author Janne Teller. The debate, entitled Technology and Freedom, took place at Hay-on-Wye in summer 2025. On the subject of copyright, Nguyen says:

The reason AI is so powerful is because it’s scraped all this data on the internet and of course that has all these issues in terms of copyright law and whatnot. But that’s also democratised knowledge and so there are ways in which it’s been good and bad. But now we have this very intelligent system which has all this knowledge from books, but then maybe there are going to be some authors who aren’t going to be very happy. So there are always going to be winners and losers.

Teller replies:

This is an undermining of any intellectual property rights we have developed up to now. Anything you have written on a Facebook post which is public will be considered by this Metaverse as something they can use to develop their AI and you might say OK, that’s a new form of sharing. Anything you contribute, everybody owns it. But then that speaks to nationalising all technology platforms. You want to have everything everyone else has created. But then we want to have your work also and have control over it.

The clip cuts off here and I haven’t seen a video of the full debate, so I don’t know how Nguyen replied. But I think Teller made a good point. It’s not just that LLMs have been trained (illegally) on masses of copyrighted material. They have also been trained on data from the internet, which is a public good, and perhaps the people who unwittingly created all the training data should be entitled to some sort of compensation. Even the slopigarchs themselves acknowledge this. For example, in 2017, Elon Musk said that the pace of change is:

a massive social challenge. And I think ultimately we will have to have some kind of universal basic income (UBI). I don’t think we’re going to have a choice.

At the moment we are facing two possible outcomes. Either AI progress grinds to a halt and the bubble bursts, or AI breakthroughs continue to happen at a rapid pace, replacing human jobs, and everyone ends up becoming unemployed until they can find other jobs to do. Every previous improvement in technology, no matter how disruptive, eventually ended up with people finding other things to do, so the economy will keep going somehow. But before we reach that point we may find ourselves facing serious social unrest. As Teller suggests, perhaps it is the AI companies themselves who should pay for this. After all, they did steal everyone else’s work to train their models. But if, in some grim future in which companies like OpenAI become profitable, we eventally get compensation, how much compensation should we get?

It seems like this question has no answer. But actually there’s a simple heuristic for evaluating the relative contributions of the model and the data, which I want to explain in this post. Not only is this heuristic relevant for musing about the future of AI, but it’s also surprisingly useful in everyday data science, too.

The Cover-Hart Theorem

Consider a classification problem in which the input is a data point $x$ contained in some metric space (i.e. a set equipped with a notion of distance) $(X, d)$, and the output is a classification into one of $M$ classes. The classifier is evaluated by the percentage of data points which it classifies correctly (the accuracy). If $A$ is the accuracy then $R = 1-A$ is called the error rate.

The Bayes Rate $R^ast$ for the problem is defined to be the best possible accuracy which any classifier could have. Why isn’t $R^ast$ just 100%? That’s because the same point might appear in more than one class! See the example below.

Suppose a data set $mathcal{X}$ is given. It consists of some points $x_i in X$ and the corresponding classes $theta_i$. We want to use the data set $mathcal{X}$ to build a classifier.

The 1-Nearest Neighbour or 1-NN classifier is the classifier $C$ which simply assigns an unseen data point $x$ to the class of the closest point to $x$ in $mathcal{X}$ (for simplicity, let’s assume that $mathcal{X}$ doesn’t contain any duplicate points). That is, if $d(x, x_i) = min_{y in mathcal{X}}d(x, y)$ then $C(x) := theta_i$. Note that to define the 1-NN classifier, we need $X$ to be a metric space, or else there is no notion of the nearest neighbour.

The theorem which Cover and Hart proved in 1967 is that the error rate $R$ of the 1-NN classifier satisfies

[R^ast le R le 2R^ast]

asymptotically as the number of data points in $mathcal{X}$ goes to $infty$, and provided that the points in $mathcal{X}$ are an iid sample from some distribution.

In other words, if you are given a data set and asked to build a predictive model, just doing the most naive thing possible and looking up the closest point in your data set to the point you want to classify already gets you halfway to the lowest possible error.

Example

Here is an example which I used to use when teaching this topic in university courses.

Let’s consider a single predictor $x$. There are two classes labelled $0$ and $1$. The distribution of $x$ for class $1$ is $N(1, 1)$ and the distribution of $x$ for class $0$ is $N(-1, 1)$. Suppose the population is equally distributed among the two classes.

The best possible classifier would classify a point $x$ into whichever class has the higher density for that particular value of $x$. The purple area represents the proprtion of points which would be misclassified. Since 50% of the population is in each class, the purple area is equal to

bayes_rate <- 1-pnorm(1)
# 0.1586553

Now suppose we are supplied with a training dataset consisting of 50 points from each class

set.seed(100)

N <- 100
df_train <- data.frame(x = c(rnorm(N/2, 1, 1), rnorm(N/2, -1, 1)), y = rep(c(1, 0), each=N/2))

The following function classifies a point using the nearest neighbour with the metric being $d(x, y) = lvert x-y rvert$.

classify_point <- function(x, df){
  df$y[which.min(abs(x-df$x))]
}

To see whether the Cover-Hart Theorem works in this example, let’s create a test data set of 1000 new points.

M <- 1000
df_test <- data.frame(x = c(rnorm(M, 1, 1), rnorm(M, -1, 1)), y = c(rep(1, M), rep(0, M)))

The error rate of the 1-NN classifier on this data set can be calculated as follows

pred <- sapply(as.list(df_test$x), function(x) classify_point(x, df_train))
1 - sum(pred == df_test$y)/length(pred)
# 0.216

As expected, $0.216$ lies between the Bayes rate and twice the Bayes rate.

Of course, many other classifiers will perform better. For example, logistic regression already almost achieves perfect accuracy.

model <- glm(y~x, data=df_train, family="binomial")
pred_logistic <- round(predict(model, df_test, type="response"))
1 - sum(pred_logistic == df_test$y)/length(pred)
# 0.16

If you run the whole script again with the same seed but with N=10000 points in the training data, you will even find that logistic regression gets an error rate which is lower than the Bayes rate! This happens because the training and test sets are finite samples from the actual data distribution, so there is some sampling error.

Practical Use

There are two ways to use this in practice. First, suppose that you are presented with a data set and build a quick and dirty classifier using 1-NN and achieve an accuracy of 80%. Then the error rate $R$ of the 1-NN classifier is 20% and the Cover-Hart Theorem tells you that the Bayes rate $R^ast ge R/2$, so the Bayes rate cannot be less than 10%, which means that you can’t expect to achieve an accuracy of better than 90% using some other algorithm. This might be a helpful guide to how much time you should spend trying to build a better classifier. In practice, the quick and dirty classifier you build will be something other than 1-NN1, and it usually has better performance than 1-NN, so this can actually be a useful way to estimate the Bayes rate on a new data set.

Secondly, suppose that you are presented with a classification algorithm with an accuracy of 95%. Then you can estimate that the Bayes rate $R^ast$ is at most 5%, because $R^ast$ is the lowest possible error rate among all classifiers. This means that the error rate of a 1-NN classifier $R$ cannot be more than 10%. But that means that a 1-NN classifier would have given you 90% accuracy. Since the 1-NN classifier is just another name for “look at the data”, you could already achieve 90% accuracy by looking at the data alone without building your fancy model. In other words, the data is doing $90/95 = 94.7%$ of the work!2

Problems with the Cover-Hart Theorem

In practice, Cover-Hart should be used only as a heuristic and not as something which is expected to hold in all cases. This is because it makes very strong assumptions about the data.

For example, consider image classification. Cover-Hart suggests that you can classify any image correctly if you find the closest image, perhaps in the sense of Euclidean distance, in some sufficiently large reference data set. But clearly the reference data set would have to be massive, and the cost of searching for the closest image would probably be extremely high.

What does Cover-Hart say about AI?

The Cover-Hart Theorem, then, doesn’t suggest a sensible way to build an AI model. For example, suppose you want to generate the next word, given a string of text. A 1-NN classifier would be supplied with a corpus of data. It would need to search through this data and find the piece of text that was the closest match to the given string, and then extract the next word from that piece of text. For some kinds of text, like The capital of France is, this might work well, but clearly it’s not going to be a good approach in general.

This isn’t how Large Language Models work at all, so how is the Cover-Hart Theorem relevant to LLMs? Well, I think it could be used as a heuristic for measuring the relative contribution of the model and the data. For example, let’s suppose an LLM has an accuracy performance of $A$ percent on some benchmark. Then, as explained above, a 1-NN classifier could be expected to achieve an error rate of $2(1-A)$ and so you could estimate that the data by itself is contributing roughly

[frac{1-2(1-A)}{A} = frac{2A-1}{A}]

of the overall performance. This could be taken as a measure of how much the data is “worth” versus the model.

For example, if an AI company achieves 80% on some benchmark, then the people who contributed the data in some sense deserve $(2(0.8)-1)/0.8 = 75%$ of the credit.

Who was right?

So was Teller correct? Do the people who generated the data deserve most of the profits (if there are any) from AI? Well, that depends on what you mean by “AI”.

In the case of LLMs, assuming that they really are able to replace people in the workplace, I think the Cover-Hart Theorem could provide a first step for deciding how to regulate or tax. But the term “AI” encompasses a lot of different models, and some of those models don’t use training data at all. For example, AlphaZero reached grandmaster-level performace in chess and superhuman performance in go by playing against itself. And this isn’t a new idea; in the 1990s TD-Gammon was already able to outperform humans in backgammon by taking a similar approach. Personally I find these kind of algorithms even more impressive than LLMs, but that’s just my opinion.

One more thing. Suppose we did find ourselves in a world in which a government was choosing to tax AI companies based on the above formula. Then we could reach a bizarre scenario in which, in order to avoid tax, the AI companies would be competing to make the ratio $(2A-1)/A$ as small as possible. This would mean that, instead of boasting about the accuracy of their models on self-chosen benchmarks, these firms would find themselves in a paradoxical race to claim that their accuracy was as low as possible.

I think that would be hilarious.


[1] By the way, the 1-NN classifier is one of the very few classifiers which outputs just a class without any notion of the strength of class membership, so you can’t define an AUC for it. This is one of the classifiers which suffers from the so-called class imbalance problem, which they ask about in every data science interview. In pratice, class imbalance is never really a problem because nobody compares classifiers by using accuracy alone.

[2] Of course, this might not be the full story. For one thing, you will probably be interested in other metrics besides accuracy. For another, your algorithm might have other advantages over the 1-NN classifier, such as coming up with predictions more quickly.

To leave a comment for the author, please follow the link and comment on their blog: datascienceconfidential – r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: So how much does OpenAI owe us?]]>

398167


rOpenSci Code of Conduct Annual Review https://www.r-bloggers.com/2026/01/ropensci-code-of-conduct-annual-review-7/

Wed, 07 Jan 2026 00:00:00 +0000
https://ropensci.org/blog/2026/01/07/conduct2025/

Read it in: Español. rOpenSci’s activities and spaces are supported by a Code of Conduct
that applies to all people participating in the rOpenSci community,
including rOpenSci staff and leadership.
It applies to all modes of interaction including GitH…

Continue reading: rOpenSci Code of Conduct Annual Review]]>
[social4i size=”small” align=”align-left”] –>
[This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Read it in: Español.

rOpenSci’s activities and spaces are supported by a Code of Conduct that applies to all people participating in the rOpenSci community, including rOpenSci staff and leadership. It applies to all modes of interaction including GitHub project repositories, the rOpenSci discussion forum, Slack, Community Calls, Co-working and social sessions, training and mentoring sessions, and in person at rOpenSci-hosted events, including affiliated social gatherings. Our Code of Conduct is developed and enforced by a committee including rOpenSci staff and an independent community member.

Here we report our annual review of rOpenSci’s Code of Conduct, reporting process, and internal guidelines for handling reports and enforcement.

Updates

  1. No changes of Committee members for 2026. The committee members are Natalia Morandeira (independent member), Mark Padgham (rOpenSci Software Research Scientist) and Yanina Bellini Saibene (rOpenSci Community Manager). We are responsible for receiving, investigating, adjudicating, enforcing, and responding to all reports of potential violations of our Code.

  2. No changes have been made to the text of the Code in English, Spanish or Portuguese.

  3. Because the text has remained unchanged, the version number is kept at 2.5, dated January 30, 2024.

Committee Members Activities

An important aspect of our work as Committee members is to make sure that the processes we follow are transparent, consistent, and fair. To support this work we’ve developed a set of templates that guide us through different stages of incident response and reporting. We shared openly these templates and explain them in detail in the blog post “rOpenSci Code of Conduct Committee Templates”.

Yanina Bellini Saibene attended the BoF “Community safety and wellbeing: Can we do more than “good enough”? at the Open Science CZI meeting to discuss Code of Conduct best practices with other organizations in the Open Science movement.

Contact Information

We welcome your feedback by email to conduct at ropensci.org, and we thank you for continuing to work with us to ensure that rOpenSci remains a safe, enjoyable, friendly and enriching environment for everyone who participates.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: rOpenSci Code of Conduct Annual Review]]>

398110



Analysis and Long-term implications of {ellmer} for dynamic alt text generation

The key focus of this text is about using a new package in R called {ellmer} to automatically generate alternative (alt) texts for images in interactive applications built with {shiny}. This is noteworthy because alt text improves accessibility, particularly for screen reader users or if an image fails to load. The text further explains that interactivity of apps and dashboards sometimes overlooks accessibility requirements, creating a gap that {ellmer} can fill.

Dynamic alt text generation using {ellmer} features consistency, fault-tolerance, and low cost. It also handles errors gracefully and contributes positively to the user experience, especially for visually impaired users. However, as with any AI tool, the article cautions that alt text generation should not replace human review and attention; it should just supplement it.

Long term, the wider adoption of this dynamic alt text generation approach can vastly improve accessibility across web platforms. By leveraging the automation capabilities of AI and data science, developers can help to create a more inclusive digital space.

Possible Future Developments

Given the possibilities of AI, it’s predictable that advancement in AI would breed more sophisticated and reliable alt text generation tools in the future. Soon, these tools could be made available in more programming languages beyond R. It’s also imaginable for alt text generation tools to be built into visual creation tools such as design apps, image editors and even cameras. Moreover, developers could explore the possibility of a universal library for alt text prediction which could make it easier for others to incorporate this accessibility feature in their own apps.

Additionally, as the demand for digital accessibility grows, we can expect more applications like this to arise in the future that aim to bridge the gap between interactivity and accessibility in various datasets and interfaces.

Actionable Advice

Here are a few steps developers can take to improve accessibility in their apps:

  1. Consider using tools like {ellmer} to generate alt text for images and graphs in your application. It not only improves the accessibility of your apps but also saves time and effort in manually generating alt texts.
  2. Strike a balance between AI-generated alt text and manual review. Remember, the ability to provide context-rich descriptions may vary among different AI tools. Therefore, it’s always a good idea to check and correct the AI-generated alternative text when necessary.
  3. Consider making alt text generation part of your development process. This approach would keep accessibility as a priority, while also making it a natural part of your workflow.

Read the original article

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.