Top 40 New CRAN Packages in February 2025

Top 40 New CRAN Packages in February 2025

[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

In February, one hundred fifty-nine new packages made it to CRAN. Here are my Top 40 picks in fifteen categories: Artificial Intelligence, Computational Methods, Ecology, Genomics, Health Sciences, Mathematics, Machine Learning, Medicine, Music, Pharma, Statistics, Time Series, Utilities, Visualization, and Weather.

Artificial Intelligence

chores v0.1.0: Provides a collection of ergonomic large language model assistants designed to help you complete repetitive, hard-to-automate tasks quickly. After selecting some code, press the keyboard shortcut you’ve chosen to trigger the package app, select an assistant, and watch your chore be carried out. Users can create custom helpers just by writing some instructions in a markdown file. There are three vignettes: Getting started, Custom helpers, and Gallery.

Working with chores gif

gander v0.1.0: Provides a Copilot completion experience that knows how to talk to the objects in your R environment. ellmer chats are integrated directly into your RStudio and Positron sessions, automatically incorporating relevant context from surrounding lines of code and your global environment. See the vignette to get started.

Gifof gander use

GitAI v0.1.0: Provides functions to scan multiple Git repositories, pull content from specified files, and process it with LLMs. You can summarize the content, extract information and data, or find answers to your questions about the repositories. The output can be stored in a vector database and used for semantic search or as a part of a RAG (Retrieval Augmented Generation) prompt. See the vignette.

Exaple GIF

Computational Methods

nlpembeds v1.0.0: Provides efficient methods to compute co-occurrence matrices, point wise mutual information (PMI), and singular value decomposition (SVD), especially useful when working with huge databases in biomedical and clinical settings. Functions can be called on SQL databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. See Hong (2021) for background and the vignette for examples.

NLPwavelet v1.0: Provides functions for Bayesian wavelet analysis using individual non-local priors as described in Sanyal & Ferreira (2017) and non-local prior mixtures as described in Sanyal (2025). See README to get started.

Wavelet plot of posterior

pnd v0.0.9: Provides functions to compute numerical derivatives including gradients, Jacobians, and Hessians through finite-difference approximations with parallel capabilities and optimal step-size selection to improve accuracy. Advanced features include computing derivatives of arbitrary order. There are three vignettes on the topics: Compatibility with numDeriv, Parallel numerical derivatives, and Step-size selection.

rmcmc v0.1.1: Provides functions to simulate Markov chains using the proposal from Livingstone and Zanella (2022) to compute MCMC estimates of expectations with respect to a target distribution on a real-valued vector space. The package also provides implementations of alternative proposal distributions, such as (Gaussian) random walk and Langevin proposals. Optionally, BridgeStan’s R interface BridgeStan can be used to specify the target distribution. There is an Introduction to the Barker proposal and a vignette on Adjusting the noise distribution.

Plot of mcmc chain adapting to Langevin proposal

sgdGMF v1.0: Implements a framework to estimate high-dimensional, generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification, including the stochastic gradient descent algorithm with a block-wise mini-batch strategy and an efficient adaptive learning rate schedule to stabilize convergence. All the theoretical details can be found in Castiglione et al. (2024). Also included are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022). There are four vignettes, including introduction and residuals.

Plots of Deviance and Pearson residuals

Data

acledR v0.1.0: Provides tools for working with data from ACLED (Armed Conflict Location and Event Data). Functions include simplified access to ACLED’s API, methods for keeping local versions of ACLED data up-to-date, and functions for common ACLED data transformations. See the vignette to get started.

Horsekicks v1/0/2: Provides extensions to the classical dataset Death by the kick of a horse in the Prussian Army first used by Ladislaus von Bortkeiwicz in his treatise on the Poisson distribution Das Gesetz der kleinen Zahlen. Also included are deaths by falling from a horse and by drowning. See the vignette.

Dangers to the Prussian Cavalry

OhdsiReportGenerator v1.0.1: Extracts results from the Observational Health Data Sciences and Informatics result database and generates Quarto reports and presentations. See the package guide.

wbwdi v1.0.0: Provides functions to access and analyze the World Bank’s World Development Indicators (WDI) using the corresponding API. WDI provides more than 24,000 country or region-level indicators for various contexts. See the vignette.

wbwdi data model

Ecology

rangr v1.0.6: Implements a mechanistic virtual species simulator that integrates population dynamics and dispersal to study the effects of environmental change on population growth and range shifts. Look here for background and see the vignette to get started.

Abundance plotsl

Economics

godley v0.2.2: Provides tools to define, simulate, and validate stock-flow consistent (SFC) macroeconomic models by specifying governing systems of equations. Users can analyze how macroeconomic structures affect key variables, perform sensitivity analyses, introduce policy shocks, and visualize resulting economic scenarios. See Godley and Lavoie (2007), Kinsella and O’Shea (2010) for background and the vignette to get started.

Example model structure

Genomics

gimap v1.0.3: Helps to calculate genetic interactions in CRISPR targets by taking data from paired CRISPR screens that have been pre-processed to count tables of paired gRNA reads. Output are genetic interaction scores, the distance between the observed CRISPR score and the expected CRISPR score. See Berger et al. (2021) for background and the vignettes Quick Start, Timepoint Experiment, and Treatment Experiment.

MIC v1.0.2: Provides functions to analyze, plot, and tabulate antimicrobial minimum inhibitory concentration (MIC) data and predict MIC values from whole genome sequence data stored in the Pathosystems Resource Integration Center (2013) database or locally. See README for examples.

Results of MIC test

Health Sciences

matriz v1.0.1: Implements a workflow that provides tools to create, update, and fill literature matrices commonly used in research, specifically epidemiology and health sciences research. See README to get started.

Mathematics

flint v0.0.3: Provides an interface to FLINT, a C library for number theory which extends GNU MPFR and GNU MP with support for arithmetic in standard rings (the integers, the integers modulo n, the rational, p-adic, real, and complex numbers) as well as vectors, matrices, polynomials, and power series over rings and implements midpoint-radius interval arithmetic, in the real and complex numbers See Johansson (2017) for information on computation in arbitrary precision with rigorous propagation of errors and see the NIST Digital Library of Mathematical Functions for information on additional capabilities. Look here to get started.

Machine Learning

tall v0.1.1: Implements a general-purpose tool for analyzing textual data as a shiny application with features that include a comprehensive workflow, data cleaning, preprocessing, statistical analysis, and visualization. See the vignette.

Automatic Lemmatization and PoS-Tagging through LLM “}

Medicine

BayesERtools v0.2.1: Provides tools that facilitate exposure-response analysis using Bayesian methods. These include a streamlined workflow for fitting types of models that are commonly used in exposure-response analysis – linear and Emax for continuous endpoints, logistic linear and logistic Emax for binary endpoints, as well as performing simulation and visualization. Look here to learn more about the workflow, and see the vignette for an overview.

Chart of supported model types

Medicine Continued

PatientLevelPrediction v6.4.0: Implements a framework to create patient-level prediction models using the Observational Medical Outcomes Partnership Common Data Model. Given a cohort of interest and an outcome of interest, the package can use data in the Common Data Model to build a large set of features, which can then be used to fit a predictive model with a number of machine learning algorithms. This is further described in Reps et al. (2017). There are fourteen vignettes, including Building Patient Level Prediction Models and Best Practices.

Scematic of the Prediction problem

SimTOST v1.0.2: Implements a Monte Carlo simulation approach to estimating sample sizes, power, and type I error rates for bio-equivalence trials that are based on the Two One-Sided Tests (TOST) procedure. Users can model complex trial scenarios, including parallel and crossover designs, intra-subject variability, and different equivalence margins. See Schuirmann (1987), Mielke et al. (2018), and Shieh (2022) for background. There are seven vignettes including Introduction and Bioequivalence Tests for Parallel Trial Designs: 2 Arms, 1 Endpoint.

Power estimation plots

Music

musicXML v1.0.1: Implements tools to facilitate data sonification and create files to share music notation in the musicXML format. Several classes are defined for basic musical objects such as note pitch, note duration, note, measure, and score. Sonification functions map data into musical attributes such as pitch, loudness, or duration. See the blog and Renard and Le Bescond (2022) for examples and the vignette to get started.

WaggaWagga time series is mapped to pitch.

Pharma

emcAdr v1.2: Provides computational methods for detecting adverse high-order drug interactions from individual case safety reports using statistical techniques, allowing the exploration of higher-order interactions among drug cocktails. See the vignette.

Plots of estimated andtrue distributions

SynergyLMM v1.0.1: Implements a framework for evaluating drug combination effects in preclinical in vivo studies, which provides functions to analyze longitudinal tumor growth experiments using linear mixed-effects models, perform time-dependent analyses of synergy and antagonism, evaluate model diagnostics and performance, and assess both post-hoc and a priori statistical power. See Demidenko & Miller (2019 for the calculation of drug combination synergy and Pinheiro and Bates (2000) and Gałecki & Burzykowski (2013) for information on linear mixed-effects models. The vignette offers a tutorial.

the SynergyLLM workflow

vigicaen v0.15.6: Implements a toolbox to perform the analysis of the World Health Organization (WHO) Pharmacovigilance database, VigiBase, with functions to load data, perform data management, disproportionality analysis, and descriptive statistics. Intended for pharmacovigilance routine use or studies. There are eight vignettes, including basic workflow and routine pharmacoviligance.

Example of vigibase analysis

Psychology

cogirt v1.0.0: Provides tools to psychometrically analyze latent individual differences related to tasks, interventions, or maturational/aging effects in the context of experimental or longitudinal cognitive research using methods first described by Thomas et al. (2020). See the vignette.

Plot showing subjects timescore data

Statistics

DiscreteDLM v1.0.0: Provides tools for fitting Bayesian distributed lag models (DLMs) to count or binary, longitudinal response data. Count data are fit using negative binomial regression, binary are fit using quantile regression. Lag contribution is fit via b-splines. See Dempsey and Wyse (2025) for background and README for examples.

A regression slopes ridge plot oneinfl v1.0.1: Provides functions to estimate Estimates one-inflated positive Poisson, one-inflated zero-truncated negative binomial regression models, positive Poisson models, and zero-truncated negative binomial models along with marginal effects and their standard errors. The models and applications are described in Godwin (2024). See README for and example.

Plot of distributions vs actual data

Time Series

BayesChange v2/0/0: Provides functions for change point detection on univariate and multivariate time series according to the methods presented by Martinez & Mena (2014) and Corradin et al. (2022) along with methods for clustering time dependent data with common change points. See Corradin et. al. (2024). There is a tutorial.

Time series clusters

echos v1.0.3: Provides a lightweight implementation of functions and methods for fast and fully automatic time series modeling and forecasting using Echo State Networks. See the vignettes Base functions and Tidy functions.

Example of time series forecast quadVAR v0.1.2: Provides functions to estimate quadratic vector autoregression models with the strong hierarchy using the Regularization Algorithm under Marginality Principle of Hao et al. (2018) to compare the performance with linear models and construct networks with partial derivatives. See README for examples.

Example model plot

Utilities

aftables v1.0.2: Provides tools to generate spreadsheet publications that follow best practice guidance from the UK government’s Analysis Function. There are four vignettes, including an Introduction and Accessibility.

watcher v0.1.2: Implements an R binding for libfswatch, a file system monitoring library, that enables users to watch files or directories recursively for changes in the background. Log activity or run an R function every time a change event occurs. See the README for an example.

Visualization

jellyfisher v1.0.4: Generates interactive Jellyfish plots to visualize spatiotemporal tumor evolution by integrating sample and phylogenetic trees into a unified plot. This approach provides an intuitive way to analyze tumor heterogeneity and evolution over time and across anatomical locations. The Jellyfish plot visualization design was first introduced by Lahtinen et al. (2023). See the vignette.

Example Jellyfisher Plot

xdvir v0.1-2: Provides high-level functions to render LaTeX fragments as labels and data symbols in ggplot2 plots, plus low-level functions to author, produce, and typeset LaTeX documents, and to produce, read, and render DVIfiles. See the vignette.

Plot of normal distribution with defining equation

Weather

RFplus v1.4-0: Implements a machine learning algorithm that merges satellite and ground precipitation data using Random Forest for spatial prediction, residual modeling for bias correction, and quantile mapping for adjustment, ensuring accurate estimates across temporal scales and regions. See the vignette.

SPIChanges v0.1.0: Provides methods to improve the interpretation of the Standardized Precipitation Index under changing climate conditions. It implements the nonstationary approach of Blain et al. (2022) to detect trends in rainfall quantities and quantify the effect of such trends on the probability of a drought event occurring. There is an Introduction and a vignette Monte Carlo Experiments and Case Studies.

Year-to-year changes in drought frequency

To leave a comment for the author, please follow the link and comment on their blog: R Works.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: February 2025 Top 40 New CRAN Packages

Analysis and Future Implications of February 2025 New CRAN Packages

Over the course of February 2025, 159 new packages made it to the Comprehensive R Archive Network (CRAN). With immense advancements in dynamic fields such as Artificial Intelligence, Genomics, Machine Learning and others, this represents another leap into a future powered by groundbreaking data-analytics too. But what does this mean for users of these packages? What longer-term implications do these hold?

Artificial Intelligence-Based Packages

Artificial Intelligence has shown significant advancements recently. The newly released packages, such as chores v0.1.0, gander v0.1.0, and GitAI v0.1.0, showcase versatile features like language model assistants, Copilot completion experience, and functions to scan Git repositories. Considering the increasing importance of automating tasks and the capabilities these packages offer, they’re expected to gain more popularity.

Actionable Advice:

Artificial Intelligence is an ever-evolving field. Stay updated with the latest advancements like large language models and more efficient programming. Learning to use new packages like chores, gander, and GitAI could help improve efficiency in automating tasks.

Computational Methods-Based Packages

New tools like nlpembeds v1.0.0, NLPwavelet v1.0, and rmcmc v0.1.1 are milestones in Computational Methods’ evolution. Such packages demonstrate the community’s focusing on computation efficiency and modeling, even with very large data sets.

Actionable Advice:

Consider updating your skills to effectively handle large volumes of data and make sense of complex data sets using packages like nlpembeds and rmcmc.

Data Packages

Twelve new data packages, including acledR v0.1.0 and Horsekicks v1/0/2, provide the community with preloaded datasets and functions to handle specific types of data efficiently. They offer potential to researchers to undertake complex studies without the hassle of preprocessing big data.

Actionable Advice:

Stay updated with the latest data packages available on CRAN to improve the efficiency of your studies and to provide a robust framework for your research.

Machine Learning Packages

A new package like tall v0.1.1 implies a user-friendly approach to analyzing textual data using machine learning. This shows a clear trend towards user-friendly, visual, and interactive tools for applied machine learning in textual data analysis.

Actionable Advice:

As a data scientist or analyst, consider deploying machine learning tools like tall in your work. It would streamline the process of extracting insights from raw textual data.

Visualization Packages

Visualization tools like jellyfisher v1.0.4 and xdvir v0.1-2 provide intuitive ways to analyze and present data, which is a crucial aspect of data analysis.

Actionable Advice:

Should you be presenting complex data sets to an audience, consider using such visualization tools to simplify consumption and interpretation.

Long-term Implications and Future Developments

CRAN’s latest package releases suggest exciting developments in fields of Artificial Intelligence, Computational Methods, Machine Learning, Data and Visualization. With the pace at which these fields are growing, professionals relying on data analysis and researchers should anticipate even more sophisticated tools and computations in the pipeline. This further indicates a clear need to keep up with understanding and ability to deploy these constantly evolving tools.

Actionable Advice:

Continually learning and applying newly released packages should be a part of your long-term strategy. This will ensure you stay ahead in the data science world, leveraging the most effective and sophisticated tools at your disposal.

Read the original article

“Improving Multimedia Quality Model Evaluation with Constrained Concordance Index”

“Improving Multimedia Quality Model Evaluation with Constrained Concordance Index”

arXiv:2411.05794v1 Announce Type: new
Abstract: This study investigates the evaluation of multimedia quality models, focusing on the inherent uncertainties in subjective Mean Opinion Score (MOS) ratings due to factors like rater inconsistency and bias. Traditional statistical measures such as Pearson’s Correlation Coefficient (PCC), Spearman’s Rank Correlation Coefficient (SRCC), and Kendall’s Tau (KTAU) often fail to account for these uncertainties, leading to inaccuracies in model performance assessment. We introduce the Constrained Concordance Index (CCI), a novel metric designed to overcome the limitations of existing metrics by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. Through comprehensive experiments across various domains including speech and image quality assessment, we demonstrate that CCI provides a more robust and accurate evaluation of instrumental quality models, especially in scenarios of low sample sizes, rater group variability, and restriction of range. Our findings suggest that incorporating rater subjectivity and focusing on statistically significant pairs can significantly enhance the evaluation framework for multimedia quality prediction models. This work not only sheds light on the overlooked aspects of subjective rating uncertainties but also proposes a methodological advancement for more reliable and accurate quality model evaluation.

Expert Analysis: Evaluating Multimedia Quality Models and Overcoming Subjective Rating Uncertainties

As multimedia systems continue to evolve, it becomes increasingly important to develop reliable and accurate quality models to assess the performance of these systems. However, traditional statistical measures often fall short in accounting for the uncertainties inherent in subjective Mean Opinion Score (MOS) ratings, leading to inaccurate model assessments. This study proposes a novel metric, the Constrained Concordance Index (CCI), to address these limitations and provide a more robust evaluation framework for multimedia quality prediction models.

One of the key challenges in evaluating multimedia quality models is the presence of rater inconsistency and bias. MOS ratings can vary significantly among different raters, making it difficult to assess the true performance of a model. Additionally, rater bias can introduce systematic errors into the evaluation process. The CCI takes these factors into account by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. This approach helps to mitigate the impact of rater variability and provides a more accurate assessment of model performance.

Moreover, this study demonstrates the effectiveness of the CCI through comprehensive experiments across various domains, including speech and image quality assessment. The CCI’s ability to provide reliable evaluations even in scenarios of low sample sizes, rater group variability, and restricted ranges makes it a valuable tool for assessing the performance of multimedia quality prediction models.

This research also highlights the multi-disciplinary nature of the concepts involved in evaluating multimedia quality models. The study draws on statistical methods, such as Pearson’s Correlation Coefficient (PCC), Spearman’s Rank Correlation Coefficient (SRCC), and Kendall’s Tau (KTAU), to analyze the limitations of these traditional measures and build upon them. By incorporating subjectivity and addressing uncertainties in subjective ratings, the CCI brings together concepts from psychology, human perception, and statistical analysis to enhance quality model evaluation.

From a broader perspective, this work aligns with the field of multimedia information systems, which aims to develop techniques for organizing, processing, and retrieving multimedia data. Quality models play a crucial role in assessing the effectiveness of these systems, and the CCI offers a methodological advancement that can contribute to more reliable and accurate evaluations. Furthermore, the concepts presented in this study have implications beyond traditional multimedia systems. Animations, artificial reality, augmented reality, and virtual realities are all areas where multimedia quality is of utmost importance, and the CCI can provide a valuable tool for evaluating and improving the user experience.

In conclusion, the evaluation of multimedia quality models is a complex task that requires an understanding of statistical analysis, human perception, and subjective rating uncertainties. The Constrained Concordance Index (CCI) introduced in this study offers a promising solution to overcome the limitations of traditional metrics and enhance the evaluation framework. By focusing on statistically significant pairs and considering the inherent uncertainties in subjective ratings, this research makes a valuable contribution to the field of multimedia information systems and has the potential to impact various domains, such as animations, artificial reality, augmented reality, and virtual realities.

Read the original article

“Mastering Supply Chain Management with R and the planr Package”

“Mastering Supply Chain Management with R and the planr Package”

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Hey guys, welcome back to my R-tips newsletter. Supply chain management is essential in making sure that your company’s business runs smoothly. One of the key elements is managing inventory efficiently. Today, I’m going to show you how to estimate inventory and forecast inventory levels using the planr package in R. Let’s dive in!

Table of Contents

Here’s what you’ll learn in this article:

Supply Chain Analysis with R Using the planr Package

Get the Code (In the R-Tip 084 Folder)


SPECIAL ANNOUNCEMENT: ChatGPT for Data Scientists Workshop on October 23rd

Inside the workshop I’ll share how I built a Machine Learning Powered Production Shiny App with ChatGPT (extends this data analysis to an insane production app):

ChatGPT for Data Scientists

What: ChatGPT for Data Scientists

When: Wednesday October 23rd, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, ChatGPT is changing the game. There’s a ton of hype. But how can ChatGPT actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free chatgpt for data scientists workshop.

Price: Does Free sound good?

How To Join: 👉 Register Here


R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here is the link to get set up. 👇

How to Project Inventories with the planr Package

Why Inventory Projections Are Crucial to Supply Chain Management

Supply chain management is all about balancing supply and demand to ensure that inventory levels are optimized. Overestimating demand leads to excess stock, while underestimating it causes shortages. Accurate inventory projections allow businesses to plan ahead, make data-driven decisions, and avoid costly errors like over-buying inventory or getting into a stock-outage and having no inventory to meet demand.

Enter the planr Package

The planr package simplifies inventory management by projecting future inventory levels based on supply, demand, and current stock levels.

Planr Github

Supply Chain Analysis with planr

Let’s take a look at how to use planr to optimize your supply chain. We’ll go through a quick tutorial to get you started using planr to project and manage inventories.

Step 1: Load Libraries and Data

First, you need to install the required packages and load the libraries. Run this code:

Libraries

Data

Get the Code (In the R-Tip 087 Folder)

This data contains supply and demand information for various demand fulfillment units (DFUs) over a period of time.

  • Demand Fullfillment Unit (DFU): A product identifier, here labeled as “Item 000001” (there are 10 items total).
  • Period: Monthly periods corresponding to supply and demand.
  • Demand: Customers purchase and reduce on-hand inventory.
  • Opening: An initial inventory of 6570 units in the first period for Item 000001.
  • Supply: New supplies arriving in subsequent months.

Step 2: Visualizing Demand Over Time

The first step in understanding supply chain performance is visualizing demand trends. We can use timetk::plot_time_series() to get a clear view of the demand fluctuations. Run this code:

timetk::plot_time_series() code

Get the Code (In the R-Tip 087 Folder)

This code will produce a set of time series plots that show how demand changes over time for each DFU. By visualizing these trends, you can identify patterns and outliers that may impact your projections.

timetk plot time series plot

Step 3: Projecting Inventory Levels

Once you have a good understanding of demand, the next step is to project your future inventory levels. The planr::light_proj_inv() function helps you do this. Run this code:

Light Inventory Projection

Get the Code (In the R-Tip 087 Folder)

This function takes in the DFU, Period, Demand, Opening stock, and Supply as inputs to project inventory levels over time by item. The output is a data frame that contains the projected inventories for each period and DFU.

Step 4: Creating an Interactive Table for Projected Inventories

To make your projections more interactive and accessible, you can create an interactive table using reactable and reactablefmtr. I’ve made a function to automate the process for you based on the planr’s awesome documentation. Run this code:

Interactive Table Code

Projected Inventory Table

Get the Code (In the R-Tip 087 Folder)

This generates a beautiful interactive table where you can filter and sort the projected inventories. Interactive tables make it easier to analyze your data and share insights with your team.

Conclusion

By using the planr package, you can project inventory levels with ease, helping you manage your supply chain more effectively. This leads to better decision-making, reduced stockouts, and lower carrying costs.

But there’s more to mastering supply chain analysis in R.

If you would like to grow your Business Data Science skills with R, then please read on…

Need to advance your business data science skills?

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)

Senior VP Of Analytics At JP Morgan ($200K)

50%+ Raises & Promotions ($150K)

Lead Data Scientist at Northwestern Mutual ($175K)

2X-ed Salary (From $60K to $120K)

2 Competing ML Job Offers ($150K)

Promotion to Lead Data Scientist ($175K)

Data Scientist Job at Verizon ($125K+)

Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

What They're Doing - 5 Course R-Track


Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be…)

P.S. – Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

Success Samantha Got The Job

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Supply Chain Analysis with R Using the planr Package

Understanding Supply Chain Analysis With R Using The Planr Package

Efficient supply chain management is the backbone of any successful company and proper inventory management is a key component. In this regard, the ‘planr’ package in R is a crucial tool aiding in inventory estimation and forecasting. Understanding and optimizing the supply chain through this package can lead to better decision-making, fewer stockouts, and reduced carrying costs.

Why are Inventory Projections Important?

Managing a supply chain effectively includes balancing supply and demand to optimize inventory levels. Constant management and analysis of these levels help businesses avoid oversights such as over-purchasing or having insufficient stock to meet demand. This necessitates the use of tools for accurate inventory projection, which can assist businesses in making informed decisions to prevent major, potentially costly errors.

The Planr Package in R

The ‘planr’ package simplifies inventory management through its ability to project future inventory levels. Basing its projections on supply, demand, and current stock levels, you can use the package to greatly assist your company’s supply chain optimization efforts.

Using the Planr Package for Supply Chain Analysis

Briefly, the process of using the ‘planr’ package involves loading libraries and data related to supply and demand for various demand fulfillment units (DFUs). Visualization of demand trends over time can assist in identifying patterns and outliers that might impact projections. You can then project future inventory levels based on all of these insights and create an interactive table for these projections for ease of interpretation and sharing.

Application in Businesses and Future Developments

As businesses continue to strive for efficiency, the use of tools such as planr will likely become more widespread. Accurate inventory projections can significantly reduce the chances of costly errors, enhance decision-making, and improve overall supply chain management. In the future, these tools may also evolve to incorporate more complex variables and more accurate prediction models to further enhance their utility.

Actionable Advice

Considering the above, here are some steps that businesses can take:

  1. Adopt Tools for Inventory Projections: Implement tools like the ‘planr’ package in R for efficient inventory management and accurate projection. This small step can make a significant difference in supply chain management and decision-making processes.
  2. Invest in Training: Encourage employees to learn and understand how to use data science tools for business optimizations. This can increase in-house capabilities and reduce dependency on external resources.
  3. Keep Abreast with Technological Advancements: Stay informed about developments in data science. New tools and resources regularly emerge that can improve various business processes.
  4. Consider a Data Science Career: For individuals, considering a career in data science can be rewarding and beneficial, as evidenced by various testimonials. Data science skills are in demand and can lead to promising career opportunities.

In conclusion, tools like the planr package in R show how data science can come to the rescue of businesses, helping optimize their supply chain management. Its adoption and mastery can lead to multiple benefits in the long run.

Read the original article

What Would Happen Next? Predicting Consequences from An Event Causality Graph

What Would Happen Next? Predicting Consequences from An Event Causality Graph

arXiv:2409.17480v1 Announce Type: new Abstract: Existing script event prediction task forcasts the subsequent event based on an event script chain. However, the evolution of historical events are more complicated in real world scenarios and the limited information provided by the event script chain also make it difficult to accurately predict subsequent events. This paper introduces a Causality Graph Event Prediction(CGEP) task that forecasting consequential event based on an Event Causality Graph (ECG). We propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. In SeDGPL, (1) we design a Distance-sensitive Graph Linearization (DsGL) module to reformulate the ECG into a graph prompt template as the input of a PLM; (2) propose an Event-Enriched Causality Encoding (EeCE) module to integrate both event contextual semantic and graph schema information; (3) propose a Semantic Contrast Event Prediction (ScEP) module to enhance the event representation among numerous candidate events and predict consequential event following prompt learning paradigm. %We construct two CGEP datasets based on existing MAVEN-ERE and ESC corpus for experiments. Experiment results validate our argument our proposed SeDGPL model outperforms the advanced competitors for the CGEP task.
The article “Causality Graph Event Prediction: A Semantic Enhanced Distance-sensitive Graph Prompt Learning Model” addresses the limitations of existing script event prediction tasks in accurately forecasting subsequent events. It introduces a new task called Causality Graph Event Prediction (CGEP), which uses an Event Causality Graph (ECG) to forecast consequential events. To tackle this task, the authors propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model. The SeDGPL Model consists of three key modules: a Distance-sensitive Graph Linearization (DsGL) module, an Event-Enriched Causality Encoding (EeCE) module, and a Semantic Contrast Event Prediction (ScEP) module. These modules aim to reformulate the ECG, integrate event contextual semantic and graph schema information, and enhance event representation, respectively. The authors validate their proposed SeDGPL model through experiments conducted on two CGEP datasets, demonstrating its superior performance compared to advanced competitors in the field.

Exploring Consequential Event Prediction with Causality Graphs

In the realm of event prediction, existing approaches often rely on event script chains to forecast subsequent events. However, the complexities of historical events in real-world scenarios and the limited information offered by event script chains have made it challenging to accurately predict future events. To tackle these issues, this article introduces a new task called Causality Graph Event Prediction (CGEP), which aims to forecast consequential events based on an Event Causality Graph (ECG).

Introducing the SeDGPL Model for CGEP Task

Addressing the aforementioned challenges, we propose the Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. The SeDGPL model incorporates several key components to improve event prediction accuracy:

  1. Distance-sensitive Graph Linearization (DsGL) Module: This module reformulates the ECG, representing the causal relationships, into a graph prompt template. This template serves as the input for a Pre-trained Language Model (PLM) to capture contextual information.
  2. Event-Enriched Causality Encoding (EeCE) Module: By integrating both event contextual semantics and graph schema information, the EeCE module enhances the understanding of the relationships between events, enriching the representation of each event.
  3. Semantic Contrast Event Prediction (ScEP) Module: This module aims to improve event representation and predict consequential events by leveraging a prompt learning paradigm. It effectively compares and contrasts numerous candidate events to make accurate predictions.

Through the combination of these three modules, the SeDGPL model offers an innovative approach to the CGEP task.

Validating the SeDGPL Model

To assess the performance of our proposed SeDGPL model, we constructed two CGEP datasets based on the existing MAVEN-ERE and ESC corpus. These datasets allowed us to conduct experiments and evaluate the efficacy of our model.

The experimental results demonstrated that the SeDGPL model outperformed advanced competitors in the CGEP task. Its ability to effectively incorporate contextual semantics, graph schema information, and prompt learning significantly improved the accuracy of consequential event prediction.

Innovation in Event Prediction

The introduction of the CGEP task, along with the SeDGPL model, represents a novel approach to event prediction. By moving beyond traditional event script chains and incorporating causality graphs, our model enhances the understanding of complex historical events. This innovation opens up new possibilities for accurately predicting subsequent events in real-world scenarios.

Furthermore, the SeDGPL model’s ability to incorporate semantic contrast event prediction and distance-sensitive graph linearization provides a framework for future advancements in event prediction models. As researchers continue to explore the potential of causality graphs, we can expect further improvements in forecasting consequential events.

In conclusion, the Causality Graph Event Prediction task, coupled with the Semantic Enhanced Distance-sensitive Graph Prompt Learning model, represents an innovative solution for accurate event prediction. By leveraging the power of causality graphs and integrating contextual semantics, graph schema information, and prompt learning, our model opens up new avenues in accurately forecasting subsequent events. As we continue to explore and refine these methodologies, we can expect even greater advancements in the field of event prediction.

The paper introduces a new task called Causality Graph Event Prediction (CGEP), which aims to forecast subsequent events based on an Event Causality Graph (ECG). The authors argue that the existing script event prediction task, which relies on event script chains, is not sufficient for accurately predicting subsequent events in real-world scenarios due to the complexity of the evolution of historical events and the limited information provided by event script chains.

To address these limitations, the authors propose a Semantic Enhanced Distance-sensitive Graph Prompt Learning (SeDGPL) Model for the CGEP task. The SeDGPL model consists of three main components:

1. Distance-sensitive Graph Linearization (DsGL) module: This module reformulates the ECG into a graph prompt template, which serves as the input for a Pre-trained Language Model (PLM). By linearizing the graph, the model can effectively capture the dependencies and relationships between events.

2. Event-Enriched Causality Encoding (EeCE) module: This module integrates both event contextual semantic information and graph schema information. By incorporating the contextual information of events, the model can better understand the relationships between events and make more accurate predictions.

3. Semantic Contrast Event Prediction (ScEP) module: This module aims to enhance the event representation among numerous candidate events and predict the consequential event following a prompt learning paradigm. By contrasting different candidate events, the model can identify the most probable subsequent event.

The authors conducted experiments on two CGEP datasets constructed based on existing MAVEN-ERE and ESC corpus. The experimental results validate the effectiveness of the proposed SeDGPL model, as it outperforms advanced competitors for the CGEP task.

Overall, this paper introduces a novel approach to event prediction by leveraging Event Causality Graphs and proposes a SeDGPL model that incorporates semantic information and graph schema to improve the accuracy of subsequent event forecasting. The experimental results provide evidence of the model’s superiority over existing competitors in the CGEP task. This work opens up new possibilities for more accurate event prediction in real-world scenarios and has potential applications in various domains, such as natural language understanding and event forecasting systems.
Read the original article

Improving AlphaFlow for Efficient Protein Ensembles Generation

Improving AlphaFlow for Efficient Protein Ensembles Generation

arXiv:2407.12053v1 Announce Type: new Abstract: Investigating conformational landscapes of proteins is a crucial way to understand their biological functions and properties. AlphaFlow stands out as a sequence-conditioned generative model that introduces flexibility into structure prediction models by fine-tuning AlphaFold under the flow-matching framework. Despite the advantages of efficient sampling afforded by flow-matching, AlphaFlow still requires multiple runs of AlphaFold to finally generate one single conformation. Due to the heavy consumption of AlphaFold, its applicability is limited in sampling larger set of protein ensembles or the longer chains within a constrained timeframe. In this work, we propose a feature-conditioned generative model called AlphaFlow-Lit to realize efficient protein ensembles generation. In contrast to the full fine-tuning on the entire structure, we focus solely on the light-weight structure module to reconstruct the conformation. AlphaFlow-Lit performs on-par with AlphaFlow and surpasses its distilled version without pretraining, all while achieving a significant sampling acceleration of around 47 times. The advancement in efficiency showcases the potential of AlphaFlow-Lit in enabling faster and more scalable generation of protein ensembles.
The article “AlphaFlow-Lit: Efficient Generation of Protein Ensembles with a Feature-Conditioned Generative Model” explores the conformational landscapes of proteins and the importance of understanding their biological functions. The authors introduce AlphaFlow-Lit, a feature-conditioned generative model that aims to efficiently generate protein ensembles. While previous models like AlphaFlow required multiple runs of AlphaFold to generate a single conformation, AlphaFlow-Lit focuses on the light-weight structure module, achieving comparable results with a significant sampling acceleration of around 47 times. This advancement in efficiency has the potential to enable faster and more scalable generation of protein ensembles.

Exploring Protein Conformational Landscapes with AlphaFlow-Lit

Understanding the conformational landscapes of proteins is crucial for unraveling their biological functions and properties. Researchers have long relied on computational methods to predict protein structures, and AlphaFold has emerged as a leading approach in this domain. However, AlphaFold’s need for multiple runs to generate a single conformation limits its practicality when dealing with larger protein ensembles or longer chains within a constrained timeframe.

In an effort to address this limitation, we present AlphaFlow-Lit, a feature-conditioned generative model that enables efficient protein ensembles generation while maintaining accuracy. Unlike the traditional approach of fine-tuning the entire structure, we focus solely on the light-weight structure module to reconstruct the conformation. This targeted approach allows AlphaFlow-Lit to perform on-par with AlphaFlow and surpass its distilled version without pretraining, all while achieving a significant sampling acceleration of approximately 47 times.

The key innovation of AlphaFlow-Lit lies in its ability to leverage the efficiency of the light-weight structure module while still maintaining the high accuracy of AlphaFold. By focusing on this module, which carries crucial information about local interactions, we can drastically reduce the computational burden without sacrificing quality. This reduction in computational demands opens up new possibilities for studying larger protein ensembles or longer chains within practical timeframes.

The results obtained with AlphaFlow-Lit demonstrate its potential for enabling faster and more scalable generation of protein ensembles. This breakthrough in efficiency not only accelerates the research process but also empowers researchers to explore a wider range of protein structures and their conformational landscapes. With faster and more accessible protein structure prediction, scientists can gain deeper insights into the functioning and properties of these essential molecules.

Moreover, the scalability of AlphaFlow-Lit opens up avenues for studying complex protein systems that were previously inaccessible. With the ability to generate protein ensembles efficiently, researchers can now investigate the dynamics and interactions of larger protein complexes, shedding light on how they function in various cellular processes.

In conclusion, AlphaFlow-Lit represents a significant advancement in the field of protein structure prediction. By leveraging the light-weight structure module, this feature-conditioned generative model delivers efficient and accurate protein ensembles generation. The newfound scalability and speed offered by AlphaFlow-Lit hold promise for accelerating scientific discoveries and unlocking deeper insights into the complex world of proteins.

The paper arXiv:2407.12053v1 introduces a new generative model called AlphaFlow-Lit that aims to improve the efficiency of generating protein ensembles. The authors highlight the importance of studying the conformational landscapes of proteins in understanding their biological functions and properties.

The existing model, AlphaFlow, is a sequence-conditioned generative model that incorporates flexibility into structure prediction models by fine-tuning AlphaFold under the flow-matching framework. While AlphaFlow has the advantage of efficient sampling, it still requires multiple runs of AlphaFold to generate a single conformation. This limitation restricts its applicability in sampling larger sets of protein ensembles or longer chains within a constrained timeframe.

To address this limitation, the authors propose AlphaFlow-Lit, a feature-conditioned generative model. Unlike AlphaFlow, which performs full fine-tuning on the entire structure, AlphaFlow-Lit focuses solely on the lightweight structure module for reconstructing the conformation. Despite this simplified approach, AlphaFlow-Lit performs on-par with AlphaFlow and even surpasses its distilled version without pretraining. Moreover, AlphaFlow-Lit achieves a significant sampling acceleration of approximately 47 times compared to AlphaFlow.

This advancement in efficiency is crucial as it enables faster and more scalable generation of protein ensembles. By reducing the computational resources required, researchers can now analyze larger sets of protein structures or longer chains within a reasonable timeframe. This has the potential to enhance our understanding of protein structures and their functions.

However, it is important to note that while AlphaFlow-Lit improves efficiency, it may sacrifice some accuracy compared to the full fine-tuning approach of AlphaFlow. It would be interesting to explore the trade-off between efficiency and accuracy and determine the specific scenarios where AlphaFlow-Lit is most beneficial. Additionally, further research could focus on optimizing the lightweight structure module to enhance the accuracy of AlphaFlow-Lit without compromising its efficiency.
Read the original article