“Capybara: Efficient GLM Estimation with High-Dimensional Fixed Effects”

“Capybara: Efficient GLM Estimation with High-Dimensional Fixed Effects”

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

About

Capybara is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This technique is particularly useful when estimating linear models with multiple group fixed effects.

The software can estimate GLMs from the Exponential Family and also Negative Binomial models but the focus will be the Poisson estimator because it is the one used for structural counterfactual analysis in International Trade. It is relevant to add that the IWLS estimator is equivalent with the PPML estimator from Santos-Silva et al. 2006

Tradition QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin 1962 article on vector projections offers important time and memory savings without compromising numerical stability in the estimation process.

The software heavily borrows from Gaure 20213 and Stammann 2018 works on the OLS and IWLS estimator with large k-way fixed effects (i.e., the Lfe and Alpaca packages). The differences are that Capybara uses an elementary approach and uses a minimal C++ code without parallelization, which achieves very good results considering its simplicity. I hope it is east to maintain.

The summary tables are nothing like R’s default and borrow from the Broom package and Stata outputs. The default summary from this package is a Markdown table that you can insert in RMarkdown/Quarto or copy and paste to Jupyter.

Demo

Estimating the coefficients of a gravity model with importer-time and exporter-time fixed effects.

library(capybara)

mod <- feglm(
  trade ~ dist + lang + cntg + clny | exp_year + imp_year,
  trade_panel,
  family = poisson(link = "log")
)

summary(mod)
Formula: trade ~ dist + lang + cntg + clny | exp_year + imp_year

Family: Poisson

Estimates:

|      | Estimate | Std. error | z value    | Pr(> |z|)  |
|------|----------|------------|------------|------------|
| dist |  -0.0006 |     0.0000 | -9190.4389 | 0.0000 *** |
| lang |  -0.1187 |     0.0006 |  -199.7562 | 0.0000 *** |
| cntg |  -1.3420 |     0.0005 | -2588.1870 | 0.0000 *** |
| clny |  -1.0226 |     0.0009 | -1134.1855 | 0.0000 *** |

Significance codes: *** 99.9%; ** 99%; * 95%; . 90%

Number of observations: Full 28566; Missing 0; Perfect classification 0

Number of Fisher Scoring iterations: 9 

Installation

You can install the development version of capybara like so:

remotes::install_github("pachadotdev/capybara")

Examples

See the documentation in progress: https://pacha.dev/capybara.

Benchmarks

Median time for the different models in the book An Advanced Guide to Trade Policy Analysis.

package PPML Trade Diversion Endogeneity Reverse Causality Non-linear/Phasing Effects Globalization
Alpaca 282ms 1.78s 1.1s 1.34s 2.18s 4.48s
Base R 36.2s 36.87s 9.81m 10.03m 10.41m 10.4m
Capybara 159.2ms 97.96ms 81.38ms 86.77ms 104.69ms 130.22ms
Fixest 33.6ms 191.04ms 64.38ms 75.2ms 102.18ms 162.28ms

Memory allocation for the same models

package PPML Trade Diversion Endogeneity Reverse Causality Non-linear/Phasing Effects Globalization
Alpaca 282.78MB 321.5MB 270.4MB 308MB 366.5MB 512.1MB
Base R 2.73GB 2.6GB 11.9GB 11.9GB 11.9GB 12GB
Capybara 339.13MB 196.3MB 162.6MB 169.1MB 181.1MB 239.9MB
Fixest 44.79MB 36.6MB 28.1MB 32.4MB 41.1MB 62.9MB
To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Introducing Capybara: Fast and Memory Efficient Fitting of Linear Models With High-Dimensional Fixed Effects

Capybara: A Robust Tool for GLM Estimation

Capybara represents a key advancement in software for estimating Generalized Linear Models (GLMs) from the Exponential Family and Negative Binomial models. Its key points of differentiation hinge on time and memory savings, minimal C++ code usage, and ease of maintenance. These facets of development have significant implications for its long-term adoption and use within the context of structural counterfactual analysis in International Trade, as well as other research fields that make broad use of GLMs.

Implications of Capybara’s Approaches

One of the most noteworthy underpinnings of Capybara is the efficiency it provides for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This is highly advantageous when estimating linear models with multiple group fixed effects.

The speed and small memory footprint are particularly impressive, underlining the benefits of a Halperin 1962 vector projections-based method. Traditional QR estimation, which could become unfeasible due to additional memory requirements, is thus surpassed by Capybara. In doing this, Capybara fortifies itself as a tool that could provide significant benefits for research economies going forward.

Future Developments: A Potential Goldmine of Enhancements

The present differentiators between Capybara and other software packages such as Alpaca and Base R suggest promising potential for enhancements. Given its comparatively lower need for memory allocation and faster processing times, Capybara could evolve to cater to more elaborate statistical analyses without the risk of compromising numerical stability.

The output quality of summary tables generated by Capybara is also an advantage, being similar to those from the Broom package and Stata outputs. This feature might encourage adoption by those who prefer cleaner, easily interpretable outputs. Future additions to this feature could be more customizations and improvements in formatting functions.

Actionable Advice

For Researchers: If you deal with estimation of GLMs from the Exponential Family and Negative Binomial models or directly involved in structural counterfactual analysis, adopting Capybara can likely enhance your productivity. Its succinct and efficient approach can save time, and its use of iterations in lieu of larger memory requirements means it can function exceptionally well even on low-memory systems.

For Developers: Albeit being a revelation for exemplary estimation, Capybara still utilizes elementary C++ code without parallelization. Work on integrating parallel workflows into the system, or optimizing the C++ code further for speedier process execution, could bring out even better results.

Conclusion

As a compact and memory-efficient tool, Capybara has much to offer in GLM estimations and beyond. Its novelty lies in its lean nature, the amenability of its processes and its systematic simplicity. Adapting it for mainstream use and tailoring it meticulously could reshape the way we view GLM estimations – for the better.

Read the original article

“Deepening Your Understanding of Git: Insights from ‘Pro Git’ by Scott Chacon”

“Deepening Your Understanding of Git: Insights from ‘Pro Git’ by Scott Chacon”

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

As mentioned about a million times on this blog, last year I read Git in practice by Mike McQuaid and it changed my life – not only giving me bragging rights about the reading itself. 😅 I decided to give Pro Git by Scott Chacon a go too. It is listed in the resources section of the excellent “Happy Git with R” by Jenny Bryan, Jim Hester and others. For unclear reasons I bought the first edition instead of the second one.

Git as an improved filesystem

In the Chapter 1 (Getting Started), I underlined:

“[a] mini filesystem with some incredibly powerful tools built on top of it”.

Awesome diagrams

One of my favorite parts of the book were the diagrams such as the one illustrating “Git stores data as snapshots of the project over time”.

A reminder of why we use Git

“after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository.”

The last chapter in the book, “Git internals”, include a “Data recovery” section about git reflog and git fsck.

One can negate patterns in .gitignore

I did not know about this pattern format. In hindsight it is not particularly surprising.

git log options

The format option lets one tell Git how to, well, format the log. I mostly interact with the Git log through a GUI or the gert R package, but that’s good to know.

The book also describes how to filter commits in the log (by date, author, committer). I also never do that with Git itself, but who knows when it might become useful.

A better understanding of branches

I remember reading “branches are cheap” years ago and accepting this as fact without questioning the reason behind the statement. Now thanks to reading the “Git Branching” chapter, but also Julia Evans’ blog post “git branches: intuition & reality”, I know they are cheap because they are just a pointer to a commit.

Likewise, the phrase “fast forward” makes more sense after reading “Git moves the pointer forward”.

The “Git Branching” chapter is also a place where diagrams really shine.

Is rebase better than merge

“(…) rebasing makes for a cleaner history. If you examine the log of a rebased branch, it looks like a linear history.”

“Rebasing replays changes from one line of work onto another in the order they were introduced, whereas merging takes the endpoints and merges them together.”

Reading this reminds me of the (newish?) option for merging pull requests on GitHub, “Rebase and merge your commits” – as opposed to merge or squash&merge.

Don’t rebase commits and force push to a shared branch

The opportunity to plug another blog post by Julia Evans, “git rebase: what can go wrong?”.

The chance of getting the same SHA-1 twice in your repository

“A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.”

Ancestry references

The name of references with “^” or “~” (or both!) is “ancestry references”. Both HEAD~3 and HEAD^^^ are “the first parent of the first parent of the first parent”.

Double dot and triple dot

I do not intend to try and learn this but…

The double-dot syntax “asks Git to resolve a range of commits that are reachable from one commit but aren’t reachable from another”.

The triple-dot syntax “specifies all the commits that are reachable by either of the two references but not by both of them”.

New-to-me aspects of git rebase -i

git rebase -i lists commits in the reverse order compared to git log. I am not sure why I did not make a note of this before.

I had not really realized one could edit single commits in git rebase -i. When writing “edit”, rebasing will stop at the commit one wants to edit. That strategy is actually featured in the GitHub blog post “Write Better Commits, Build Better Projects”.

Plumbing vs porcelaine

I had seen these terms before but never taken the time to look them up. Plumbing commands are the low-level commands, porcelain commands are the more user-friendly commands. At this stage, I do not think I need to be super familiar with plumbing commands, although I did click around in a .git folder out of curiosity.

Removing objects

There is a section in the “Git internals” chapter called “Removing objects”. I might come back to it if I ever need to do that… Or I’d use git obliterate from the git-extras utilities!

Conclusion

Pro Git was a good read, although I do wish I had bought the second edition. I probably missed good stuff because of this! My next (and last?) Git book purchase will be Julia Evans’ new Git zine when it’s finished. I can’t wait!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Reading notes on Pro Git by Scott Chacon

Understanding Git: From Reading Notes on ‘Pro Git’ by Scott Chacon

Git, the all-important version control system, is often noted for its learning curve. A recent blog post delves into insights gained from reading ‘Pro Git’ by Scott Chacon, expanding our understanding of Git’s capabilities and tips for using it efficiently.

Git as an Improved Filesystem

“[a] mini filesystem with some incredibly powerful tools built on top of it.”

This sentiment resonates through the book, emphasizing the unique features of Git, most notably the way it stores data as snapshots over time. This nuanced way of data handling ensures it’s difficult to lose a commit and furthermore strengthens the structure with regular pushes to another repository.

The Importance of Good Comprehension

  • Branches: The blog post highlights the importance of understanding Git-specific terms like ‘branches,’ referring to pointers to a commit.
  • Fast Forward: This term, used when Git moves the pointer forward, is a sign that the user has a good grasp of basic Git operations.
  • Rebase vs Merge: ‘Rebase’ is preferable for a cleaner history, resulting in a linear look when examining the log of a rebased branch. On the other hand, ‘merge’ takes the endpoints and merges them together.

The blog post urged caution towards rebasing commits and force pushing to a shared branch, stating that there could be potential risks involved in such tasks.

Familiarizing Oneself with Ancestry References

Ancestry references entail advanced usage of Git involving symbols like “^” or “~”. For instance, both HEAD~3 and HEAD^^^ point to “the first parent of the first parent of the first parent”.

Understanding Commit Syntax

Differentiating between double-dot and triple-dot syntax is important in Git. The former instructs Git to resolve a range of commits reachable from one commit but not from another, while the latter involves all commits reachable by either of two references, but not by both.

Editing with git rebase-i

‘git rebase -i’ enables editing of individual commits – this strategy is extensively covered in the GitHub blog post “Write Better Commits, Build Better Projects”. The command also lists commits in reverse order compared to git log.

Plumbing vs Porcelain Commands

These are Git’s low-level and user-friendly commands, respectively. Mastering both is crucial for a comprehensive understanding of Git.

Conclusion

Reading ‘Pro Git’ offers invaluable insights into Git management, highlighting aspects like rebasing, branching, commit editing and understanding various commands. Such knowledge can deeply improve one’s ability to navigate through version control systems when programming. As part of constant improvement and knowledge expansion, it’s advisable to keep exploring new resources and guides available in this area.

Read the original article

The New Functions in tidyAML: extract_wflw_pred() and extract_regression_residuals()

The New Functions in tidyAML: extract_wflw_pred() and extract_regression_residuals()

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Introduction

Yesterday I discussed the use of the function internal_make_wflw_predictions() in the tidyAML R package. Today I will discuss the use of the function extract_wflw_pred() and the brand new function extract_regression_residuals() in the tidyAML R package. We breifly saw yesterday the output of the function internal_make_wflw_predictions() which is a list of tibbles that are typically inside of a list column in the final output of fast_regression() and fast_classification(). The function extract_wflw_pred() takes this list of tibbles and extracts them from that output. The function extract_regression_residuals() also extracts those tibbles and has the added feature of also returning the residuals. Let’s see how these functions work.

The new function

First, we will go over the syntax of the new function extract_regression_residuals().

extract_regression_residuals(.model_tbl, .pivot_long = FALSE)

The function takes two arguments. The first argument is .model_tbl which is the output of fast_regression() or fast_classification(). The second argument is .pivot_long which is a logical argument that defaults to FALSE. If TRUE then the output will be in a long format. If FALSE then the output will be in a wide format. Let’s see how this works.

Example

# Load packages
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model

tidymodels_prefer() # good practice when using tidyAML

rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
  .data = mtcars,
  .rec_obj = rec_obj,
  .parsnip_eng = c("lm","glm","stan","gee"),
  .parsnip_fns = "linear_reg"
  )

Let’s break down the R code step by step:

  1. Loading Libraries:
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model

Here, the code is loading several R packages. These packages provide functions and tools for data analysis, modeling, and visualization. tidyAML and tidymodels are particularly relevant for modeling, while tidyverse is a collection of packages for data manipulation and visualization. multilevelmod is included for the Generalized Estimating Equations (gee) model.

  1. Setting Preferences:

    tidymodels_prefer() # good practice when using tidyAML

This line of code is setting preferences for the tidy modeling workflow using tidymodels_prefer(). It ensures that when using tidyAML, the tidy modeling conventions are followed. Tidy modeling involves an organized and consistent approach to modeling in R.

  1. Creating a Recipe Object:

    rec_obj <- recipe(mpg ~ ., data = mtcars)

Here, a recipe object (rec_obj) is created using the recipe function from the tidymodels package. The formula mpg ~ . specifies that we want to predict the mpg variable based on all other variables in the dataset (mtcars).

  1. Performing Fast Regression:

    frt_tbl <- fast_regression(
      .data = mtcars,
      .rec_obj = rec_obj,
      .parsnip_eng = c("lm","glm","stan","gee"),
      .parsnip_fns = "linear_reg"
    )

This part involves using the fast_regression function. It performs a fast regression analysis using various engines specified by .parsnip_eng and specific functions specified by .parsnip_fns. In this case, it includes linear models (lm), generalized linear models (glm), Stan models (stan), and the Generalized Estimating Equations model (gee). The results are stored in the frt_tbl table.

In summary, the code is setting up a tidy modeling workflow, creating a recipe for predicting mpg based on other variables in the mtcars dataset, and then performing a fast regression using different engines and functions. The choice of engines and functions allows flexibility in exploring different modeling approaches.

Now that we have the output of fast_regression() stored in frt_tbl, we can use the function extract_wflw_pred() to extract the predictions and from the output. Let’s see how this works. First, the syntax:

extract_wflw_pred(.data, .model_id = NULL)

The function takes two arguments. The first argument is .data which is the output of fast_regression() or fast_classification(). The second argument is .model_id which is a numeric vector that defaults to NULL. If NULL then the function will extract none of the predictions from the output. If a numeric vector is provided then the function will extract the predictions for the models specified by the numeric vector. Let’s see how this works.

extract_wflw_pred(frt_tbl, 1)
# A tibble: 64 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 54 more rows
extract_wflw_pred(frt_tbl, 1:2)
# A tibble: 128 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 118 more rows
extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
# A tibble: 256 × 4
   .model_type     .data_category .data_type .value
   <chr>           <chr>          <chr>       <dbl>
 1 lm - linear_reg actual         actual       15.2
 2 lm - linear_reg actual         actual       10.4
 3 lm - linear_reg actual         actual       33.9
 4 lm - linear_reg actual         actual       32.4
 5 lm - linear_reg actual         actual       16.4
 6 lm - linear_reg actual         actual       21.5
 7 lm - linear_reg actual         actual       15.8
 8 lm - linear_reg actual         actual       15
 9 lm - linear_reg actual         actual       14.7
10 lm - linear_reg actual         actual       10.4
# ℹ 246 more rows

The first line of code extracts the predictions for the first model in the output. The second line of code extracts the predictions for the first two models in the output. The third line of code extracts the predictions for all models in the output.

Now, let’s visualize the predictions for the models in the output and the actual values. We will use the ggplot2 package for visualization. First, we will extract the predictions for all models in the output and store them in a table called pred_tbl. Then, we will use ggplot2 to visualize the predictions and actual values.

pred_tbl <- extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))

pred_tbl |>
  group_split(.model_type) |>
  map((x) x |>
        group_by(.data_category) |>
        mutate(x = row_number()) |>
        ungroup() |>
        pivot_wider(names_from = .data_type, values_from = .value) |>
        ggplot(aes(x = x, y = actual, group = .data_category)) +
        geom_line(color = "black") +
        geom_line(aes(x = x, y = training), linetype = "dashed", color = "red",
                  linewidth = 1) +
        geom_line(aes(x = x, y = testing), linetype = "dashed", color = "blue",
                  linewidth = 1) +
        theme_minimal() +
        labs(
          x = "",
          y = "Observed/Predicted Value",
          title = "Observed vs. Predicted Values by Model Type",
          subtitle = x$.model_type[1]
        )
      )
[[1]]

[[2]]

[[3]]

[[4]]

Or we can facet them by model type:

pred_tbl |>
  group_by(.model_type, .data_category) |>
  mutate(x = row_number()) |>
  ungroup() |>
  ggplot(aes(x = x, y = .value)) +
  geom_line(data = . %>% filter(.data_type == "actual"), color = "black") +
  geom_line(data = . %>% filter(.data_type == "training"),
            linetype = "dashed", color = "red") +
  geom_line(data = . %>% filter(.data_type == "testing"),
            linetype = "dashed", color = "blue") +
  facet_wrap(~ .model_type, ncol = 2, scales = "free") +
  labs(
    x = "",
    y = "Observed/Predicted Value",
    title = "Observed vs. Predicted Values by Model Type"
  ) +
  theme_minimal()

Ok, so what about this new function I talked about above? Well let’s go over it here. We have already discussed it’s syntax so no need to go over it again. Let’s just jump right into an example. This function will return the residuals for all models. We will slice off just the first model for demonstration purposes.

extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = FALSE)[[1]]
# A tibble: 32 × 4
   .model_type     .actual .predicted .resid
   <chr>             <dbl>      <dbl>  <dbl>
 1 lm - linear_reg    15.2       17.3 -2.09
 2 lm - linear_reg    10.4       11.9 -1.46
 3 lm - linear_reg    33.9       30.8  3.06
 4 lm - linear_reg    32.4       28.0  4.35
 5 lm - linear_reg    16.4       15.0  1.40
 6 lm - linear_reg    21.5       22.3 -0.779
 7 lm - linear_reg    15.8       17.2 -1.40
 8 lm - linear_reg    15         15.1 -0.100
 9 lm - linear_reg    14.7       10.9  3.85
10 lm - linear_reg    10.4       10.8 -0.445
# ℹ 22 more rows

Now let’s set .pivot_long = TRUE:

extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = TRUE)[[1]]
# A tibble: 96 × 3
   .model_type     name       value
   <chr>           <chr>      <dbl>
 1 lm - linear_reg .actual    15.2
 2 lm - linear_reg .predicted 17.3
 3 lm - linear_reg .resid     -2.09
 4 lm - linear_reg .actual    10.4
 5 lm - linear_reg .predicted 11.9
 6 lm - linear_reg .resid     -1.46
 7 lm - linear_reg .actual    33.9
 8 lm - linear_reg .predicted 30.8
 9 lm - linear_reg .resid      3.06
10 lm - linear_reg .actual    32.4
# ℹ 86 more rows

Now let’s visualize the data:

resid_tbl <- extract_regression_residuals(frt_tbl, TRUE)

resid_tbl |>
  map((x) x |>
        group_by(name) |>
        mutate(x = row_number()) |>
        ungroup() |>
        mutate(plot_group = ifelse(name == ".resid", "Residuals", "Actual and Predictions")) |>
        ggplot(aes(x = x, y = value, group = name, color = name)) +
        geom_line() +
        theme_minimal() +
        facet_wrap(~ plot_group, ncol = 1, scales = "free") +
        labs(
          x = "",
          y = "Value",
          title = "Actual, Predicted, and Residual Values by Model Type",
          subtitle = x$.model_type[1],
          color = "Data Type"
        )
      )
[[1]]

[[2]]

[[3]]

[[4]]

And that’s it!

Thank you for reading and I would love to hear your feedback. Please feel free to reach out to me.

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: The new function on the block with tidyAML extract_regression_residuals()

Understanding the New Functions in the tidyAML R Package

The post discusses two new functions that have been introduced in the tidyAML R package, namely extract_wflw_pred() and extract_regression_residuals(). These functions are useful for extracting certain data from the output of some specific predictive analysis models.

Function Overview

The function extract_wflw_pred() takes a list of tibbles, typically found in the final output of fast_regression() and fast_classification() functions, and extracts them. Furthermore, the function extract_regression_residuals() performs a similar task but has an additional feature of returning residuals. The feature consequently aids in the further analysis process by revealing the difference between predicted and actual values.

Using the New Functions

The primary use of these added functions involves invoking them after implementing respective predictive models. After using the fast_regression() or fast_classification() function, you can directly use these functions on the output data. The users can select the data format in extract_regression_residuals(); this mainly involves selecting between wide or long formats.

Implementation Example

The blog post provides a comprehensive example of utilizing these functions using the mtcars dataset. Starting from preparing the data for modeling using a tidymodels and tidyAML workflow up to performing the regression and extraction of predictions.

Three scenarios demonstrated the use of extract_wflw_pred(). The first saw an extraction of predictions for the first model in the output, while the second extracted predictions for the first two models. The third example extracted predictions of all models in the output.

The Long-term Implications and Possible Future Developments

With the addition of these new functions, users can anticipate better data extraction process from their predictive analysis models. The ability for extract_wflw_pred() to specify the model from which you wish to extract predictions, and extract_regression_residuals()’s optional output formatting and provision of residuals can be considered as significant strides in predictive data analysis.

As for future developments, there may be an introduction of more functions that further enhance the extraction process or provide more data insights. Moreover, improvements or updates on these functions can also be looked upon. Additionally, an interesting prospect for upcoming additions could be a function that automatically optimizes or selects the best predictive model based on certain criteria.

Actionable Advice

In light of these insights, it is crucial for users working with predictive data analysis models to understand how these new functions operate and hence gain maximum benefit out of them:

  1. Be sure to understand both the extract_wflw_pred() and extract_regression_residuals() functions and what they can offer.
  2. Explore different scenarios for using these functions, like extracting predictions from different models, varying the data output format.
  3. Exploit the residuals provided by extract_regression_residuals() to enhance your understanding and prediction capability of your models.

By doing so, you will be able to harness these additions fully to optimize your work with the tidyAML R package.

Read the original article

The Path to a Free Self-Taught Education in Data Science: Implications and Future Trends

The Path to a Free Self-Taught Education in Data Science: Implications and Future Trends

Path to a Free Self-Taught Education in Data Science for Everyone.

Implications and Future Trends in Self-Taught Data Science Education

The surge of interest in data science has sparked a revolution in self-teaching methods, driven by the enormous appetite for this field of study in the ever-evolving tech industry. This free, accessible, and self-directed education in data science has profound long-term implications and accelerates emerging trends.

Long-term Implications

Democratizing education, particularly in a highly technical field like data science, can shift the workforce landscape significantly. By providing free resources and tools to anyone interested, we foster a larger, more diverse talent pool. These self-taught data scientists offer unique perspectives and problem-solving approaches based on their myriad backgrounds and experiences.

“The more variety we have in our problem solvers, the more we will see of innovative solutions to the complex issues riddling our world.”

Future Trends

As more people turn to self-guided learning paths, we can expect an increased transformation in the way education is delivered. Traditional brick and mortar institutions may give way to online platforms that offer flexible learning schedules and customized curriculums. Industries will continue to seek professionals who are proactive, self-motivated, and capable of learning autonomously.

Actionable Advice

If you’re considering self-education in data science, consider these steps:

  1. Start with Basics: Begin with fundamental concepts such as statistics and programming before delving into more complex data science domains.
  2. Use Free Resources: Leverage open-source platforms and free resources available online to guide your learning journey.
  3. Engage with Community: Be active in online data science communities and discussion boards. Networking with industry professionals and peers can offer guidance and support.
  4. Stay Updated: Constantly update and upskill yourself in this rapidly evolving field. Attend webinars, read articles, and follow trends and developments.
  5. Apply Knowledge: Look for opportunities to apply what you’ve learned in real-world scenarios. Participating in data science competitions can help sharpen your skills.

Conclusion

Overall, the encouraging tendency towards democratizing data science education is ushering in a new era of learning and problem-solving. It not only offers tools to empower individuals but also helps create a more diverse and innovative workforce.

Read the original article

“Reviving the R-Bloggers Bot: A Free Bluesky Bot in R”

“Reviving the R-Bloggers Bot: A Free Bluesky Bot in R”

[This article was first published on Johannes B. Gruber on Johannes B. Gruber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Bluesky is shaping up to be a nice, “billionaire-proof”1 replacement of what Twitter once was.
To bring back a piece of the thriving R community that once existed on ex-Twitter, I decided to bring back the R-Bloggers bot, which spread the word about blog posts from many R users and developers.
Especially when first learning R, this was a very important resource for me and I created my first package using a post from R-Bloggers.
Since I have recently published the atrrr package with a few friends, I thought it was a good opportunity to promote that package and show how you can write a completely free bot with it.

You can find the bot at https://github.com/JBGruber/r-bloggers-bluesky.
This posts describes how the parts fit together.

Writing the R-bot

The first part of my bot is a minimal RSS parser to get new posts from http://r-bloggers.com.
You can parse the content of an RSS feed with packages like tidyRSS, but I wanted to keep it minimal and not have too many packages in the script.2
I won’t spend too much time on this part, because it will be different for other bots.
However, if you want to build a bot to promote content on your own website or your podcast, RSS is well-suited for that and often easier to parse than HTML.

## packages
library(atrrr)
library(anytime)
library(dplyr)
library(stringr)
library(glue)
library(purrr)
library(xml2)

## Part 1: read RSS feed
feed <- read_xml("http://r-bloggers.com/rss")
# minimal custom RSS reader
rss_posts <- tibble::tibble(
  title = xml_find_all(feed, "//item/title") |>
    xml_text(),

  creator = xml_find_all(feed, "//item/dc:creator") |>
    xml_text(),

  link = xml_find_all(feed, "//item/link") |>
    xml_text(),

  ext_link = xml_find_all(feed, "//item/guid") |>
    xml_text(),

  timestamp = xml_find_all(feed, "//item/pubDate") |>
    xml_text() |>
    utctime(tz = "UTC"),

  description = xml_find_all(feed, "//item/description") |>
    xml_text() |>
    # strip html from description
    vapply(function(d) {
      read_html(d) |>
        xml_text() |>
        trimws()
    }, FUN.VALUE = character(1))
)

To create the posts for Bluesky, we have to keep in mind that the platform has a 300 character limit per post.
I want the posts to look like this:

title

first sentences of post

post URL

The first sentence of the post needs to be trimmed then to 300 characters minus the length of the title and URL.
I calculate the remaining number of characters and truncate the post description, which contains the entire text of the post in most cases.

## Part 2: create posts from feed
posts <- rss_posts |>
  # measure length of title and link and truncate description
  mutate(desc_preview_len = 294 - nchar(title) - nchar(link),
         desc_preview = map2_chr(description, desc_preview_len, function(x, y) str_trunc(x, y)),
         post_text = glue("{title}nn"{desc_preview}"nn{link}"))

I’m pretty proud of part 3 of the bot:
it checks the posts on the timeline (excuse me, I meant skyline) of the bot (with the handle r-bloggers.bsky.social) and discards all posts that are identical to posts already on the timeline.
This means the bot does not need to keep any storage of previous runs.
It essentially uses the actual timeline as its database of previous posts.
Don’t mind the Sys.setenv and auth part, I will talk about them below.

## Part 3: get already posted updates and de-duplicate
Sys.setenv(BSKY_TOKEN = "r-bloggers.rds")
auth(user = "r-bloggers.bsky.social",
     password = Sys.getenv("ATR_PW"),
     overwrite = TRUE)
old_posts <- get_skeets_authored_by("r-bloggers.bsky.social", limit = 5000L)
posts_new <- posts |>
  filter(!post_text %in% old_posts$text)

To post from an account on Bluesky, the bot uses the function post_skeet (a portmanteau of “sky” + “twee.. I mean”posting”).
Unlike most social networks, Bluesky allows users to backdate posts (the technical reasons are too much to go into here).
So I thought it would be nice to make it look like the publication date of the blog post was also when the post on Bluesky was made.

## Part 4: Post skeets!
for (i in seq_len(nrow(posts_new))) {
  post_skeet(text = posts_new$post_text[i],
             created_at = posts_new$timestamp[i])
}

Update: after a day of working well, the bot ran into a problem where a specific post used a malformed GIF image as header image, resulting in:

## ✖ Something went wrong [605ms]
## Error: insufficient image data in file `/tmp/Rtmp8Gat9r/file7300766c1e29c.gif' @ error/gif.c/ReadGIFImage/1049

So I introduced some error handling with try:

## Part 4: Post skeets!
for (i in seq_len(nrow(posts_new))) {
  # if people upload broken preview images, this fails
  resp <- try(post_skeet(text = posts_new$post_text[i],
                         created_at = posts_new$timestamp[i]))
  if (methods::is(resp, "try-error")) post_skeet(text = posts_new$post_text[i],
                                                 created_at = posts_new$timestamp[i],
                                                 preview_card = FALSE)
}

Deploying the bot on GitHub

Now I can run this script on my computer and the r-bloggers.bsky.social will post about all blog post currently in feed on http://r-bloggers.com/rss!
But for an actual bot, this needs to run not once but repeatedly!

So the choice is to either deploy this on a computer that is on 24/7, like a server.
You can get very cheap computers to do that for you, but you can also do it completely for free running it on someone else’s server (like a pro).
One such way is through Github Actions.

To do that, you need to create a free account and move the bot script into a repo.
You then need to define an “Action” which is a pre-defined script that sets up all the neccesary dependencies and then executes a task.
You can copy and paste the action file from https://github.com/JBGruber/r-bloggers-bluesky/blob/main/.github/workflows/bot.yml into the folder .github/workflows/ of your repo:

name: "Update Bot"
on:
  schedule:
    - cron: '0 * * * *' # run the bot once an hour (at every minute 0 on the clock)
  push: # also run the action when something on a new commit
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  blog-updates:
    name: bot
    runs-on: ubuntu-latest
    steps:

        # you can use this action to install R
      - uses: r-lib/actions/setup-r@v2
        with:
          r-version: 'release'

        # this one makes sure the files from your repo are accessible
      - name: Setup - Checkout repo
        uses: actions/checkout@v2

        # these dependencies are needed for pak to install packages
      - name: System dependencies
        run: sudo apt-get install -y libcurl4-openssl-dev

        # I created this custom installation of depenencies since the pre-pacakged one
        # from https://github.com/r-lib/actions only works for repos containing R packages
      - name: "Install Packages"
        run: |
          install.packages(c("pak", "renv"))
          deps <- unique(renv::dependencies(".")$Package)
          # use github version for now
          deps[deps == "atrrr"] <- "JBGruber/atrrr"
          deps <- c(deps, "jsonlite", "magick", "dplyr")
          # should handle remaining system requirements automatically
          pak::pkg_install(deps)
        shell: Rscript {0}

        # after all the preparation, it's time to run the bot
      - name: "Bot - Run"
        run: Rscript 'bot.r'
        env:
          ATR_PW: ${{ secrets.ATR_PW }} # to authenticat, store your app pw as a secret

Authentication

We paid close attention to make it as easy as possible to authenticate yourself using atrrr.
However, on a server, you do not have a user interface and can’t enter a password.
However, you also do not want to make your key public!
So after following the authentication steps, you want to put your bot’s password into .Renviron file (e.g., by using usethis::edit_r_environ()).
The you can use Sys.getenv("ATR_PW") to get the password in R.
Using the auth function, you can explitily provide your username and password to authenticate your bot to Bluesky without manual intervention.
To not interfere with my main Bluesky account, I also set the variable BSKY_TOKEN which defines the file name of your token in the current session.
Which leads us to the code you saw earlier.

Sys.setenv(BSKY_TOKEN = "r-bloggers.rds")
auth(user = "r-bloggers.bsky.social",
     password = Sys.getenv("ATR_PW"),
     overwrite = TRUE)

Then, the final thing to do before uploading everything and running your bot n GitHub for the first time is to make sure the Action script has access to the environment variable (NEVER commit your .Renviron to GitHub!).
The way you do this is by nagvigating to /settings/secrets/actions in your repository and define a repository secret with the name ATR_PW and your Bluesky App password as the value.

And that is it.
A free Bluesky bot in R!


  1. Once the protocol fulfills its vision that one can always take their follower network and posts to a different site using the protocol.↩

  2. While you can do some caching, packages need to be installed on each GitHub Actions run, meaning that every extra package increases the run time quite a bit.↩

To leave a comment for the author, please follow the link and comment on their blog: Johannes B. Gruber on Johannes B. Gruber.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Building the R-Bloggers Bluesky Bot with atrrr and GitHub Actions

R-Blogger’s Bluesky Rise: An Analysis of the Advancements

In an effort to recreate the former bustling R community on Twitter, the R-Bloggers bot has found a new social platform: Bluesky. Billed as a billionaire-proof social networking space, Bluesky seems poised to be a potential alternative to Twitter. This bot assimilates user blog posts and works towards spreading and promoting them amongst the community, acting as an integral resource, particularly for beginners learning R.

Distinctive Features of the Bot

The R-Blogger bot is based on an RSS parser that extracts new blog posts from a source like r-bloggers.com. The bot utilizes several packages, such as atrrr, anytime, dplyr, stringr, glue, purrr, and xml2, keeping extra packages to a minimum to reduce script length. The RSS parser is designed in such a way that it can be adapted to personal websites or podcasts to promote them effectively. The user-friendly setup makes the process of creating bot posts straightforward.

Challenges and Solutions

The bot’s efficiency faces challenges when it encounters broken images uploaded by users. A faulty GIF image led to a situation where the bot responded with an error message. To deal with such errors, the developer has included an error handling step with ‘try’ which ensures smooth operation even when faced with corrupted images.

Deployment through GitHub: An Easy Solution for Continual Operation

To sustain continual operation of the bot and to ensure it does not just run once on a personal computer, deploying it on GitHub via Actions is recommended. This not only ensures the bot’s incessant operation but also does so free of cost on GitHub’s own server. However, one must remember that every package added for caching needs to be installed on each GitHub Actions run, extending the run time of the bot.

Protecting User Credentials

Github also provides an intensive process of authentication. The server does not have a user interface to input a password, and hence, to secure user keys and shield them from becoming public, the bot’s password is put into a .Renviron file. This can also authenticate the bot to Bluesky without any manual intervention.

Future Implications

Considering its ease of use and much-needed resource for R beginners, the revival of the R-Bloggers bot on Bluesky marks a significant development. The convenience of deploying the bot through GitHub further increases its feasibility.

Actionable Advice

Budding developers or those looking to learn R could utilize this bot to ease their learning process and create repositories. With this efficient bot, users can better promote their content and engage more effectively with the community. It is advisable to keep a close watch on the development of this project and its adaptations for other platforms in the near future.

Read the original article

“Enhancing SQL Skills: Cheat Sheets and Future Developments”

“Enhancing SQL Skills: Cheat Sheets and Future Developments”

Want to refresh your SQL skills? Bookmark these useful cheat sheets covering SQL basics, joins, window functions, and more.

Future Developments and Long-Term Implications: SQL Skills Enhancement

As technological advancements and data-driven decision making continue to reshape the business landscape, SQL (Structured Query Language) remains an essential skill for anyone dealing with data. A substantial article recently suggested reading and bookmarking cheat sheets to help refresh SQL skills covering basics, joins, window functions and more. This post will delve deeper into the long-term implications of this ever-growing need for refined SQL skills and discuss future possibilities in this domain.

Long-Term Implications

The crux of data management, rooted in SQL, extends way beyond ordinary database management. Knowledge and proficiency in SQL can significantly enhance one’s capacity to handle complex data manipulation tasks, thereby contributing to an organization’s decision-making process. In the long run, companies and individuals who master SQL can expect an upward trajectory in their ability to analyze data, resulting in improved business operations and strategy.

Furthermore, SQL is not just a tool for programmers or data analysts. Various job roles, including digital marketers, product managers, and even UX designers, are increasingly finding SQL skills advantageous in their daily tasks. Therefore, the prevalence of SQL is expected to broaden as we move further into a data-centric era.

Possible Future Developments

The significance of SQL in the future is anticipated to be immense as we continue to enter a data-dominant world. Potential advancements may involve expanded use-case scenarios and the integration of SQL with emerging technologies.

An example of such a development could be the introduction of SQL capabilities into machine learning frameworks, thus enabling more efficient data analysis capabilities. Additionally, the rise of cybersecurity concerns may stimulate increased demand for SQL professionals adept at ensuring data integrity and security.

Actionable Advice

To adapt to the growing importance of SQL skills, key actionable advice includes:

  1. Continual Learning: Regularly updating and sharpening your SQL skills will ensure you remain relevant and valuable in your industry. The recommendation to use cheat sheets is an excellent directive towards achieving this.
  2. Integration with Other Skills: Explore the integration of SQL with other arenas such as machine learning, to expand your professional capabilities. This will set the stage for future growth and opportunities.
  3. Data Security: With increasing cybersecurity threats, learning how SQL can protect and ensure the integrity of data would be a valuable addition to your skills.

The increasing prevalence of data in almost all aspects of business justifies the need to continually enhance SQL skills. By staying proactive in mastering SQL, individuals can prepare themselves for a myriad of future possibilities and advancements in the field of data and technology.

Read the original article