by jsendak | Jan 20, 2024 | DS Articles
[This article was first published on
pacha.dev/blog, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
About
Capybara is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This technique is particularly useful when estimating linear models with multiple group fixed effects.
The software can estimate GLMs from the Exponential Family and also Negative Binomial models but the focus will be the Poisson estimator because it is the one used for structural counterfactual analysis in International Trade. It is relevant to add that the IWLS estimator is equivalent with the PPML estimator from Santos-Silva et al. 2006
Tradition QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin 1962 article on vector projections offers important time and memory savings without compromising numerical stability in the estimation process.
The software heavily borrows from Gaure 20213 and Stammann 2018 works on the OLS and IWLS estimator with large k-way fixed effects (i.e., the Lfe and Alpaca packages). The differences are that Capybara uses an elementary approach and uses a minimal C++ code without parallelization, which achieves very good results considering its simplicity. I hope it is east to maintain.
The summary tables are nothing like R’s default and borrow from the Broom package and Stata outputs. The default summary from this package is a Markdown table that you can insert in RMarkdown/Quarto or copy and paste to Jupyter.
Demo
Estimating the coefficients of a gravity model with importer-time and exporter-time fixed effects.
library(capybara)
mod <- feglm(
trade ~ dist + lang + cntg + clny | exp_year + imp_year,
trade_panel,
family = poisson(link = "log")
)
summary(mod)
Formula: trade ~ dist + lang + cntg + clny | exp_year + imp_year
Family: Poisson
Estimates:
| | Estimate | Std. error | z value | Pr(> |z|) |
|------|----------|------------|------------|------------|
| dist | -0.0006 | 0.0000 | -9190.4389 | 0.0000 *** |
| lang | -0.1187 | 0.0006 | -199.7562 | 0.0000 *** |
| cntg | -1.3420 | 0.0005 | -2588.1870 | 0.0000 *** |
| clny | -1.0226 | 0.0009 | -1134.1855 | 0.0000 *** |
Significance codes: *** 99.9%; ** 99%; * 95%; . 90%
Number of observations: Full 28566; Missing 0; Perfect classification 0
Number of Fisher Scoring iterations: 9
Installation
You can install the development version of capybara like so:
remotes::install_github("pachadotdev/capybara")
Benchmarks
Median time for the different models in the book An Advanced Guide to Trade Policy Analysis.
Alpaca |
282ms |
1.78s |
1.1s |
1.34s |
2.18s |
4.48s |
Base R |
36.2s |
36.87s |
9.81m |
10.03m |
10.41m |
10.4m |
Capybara |
159.2ms |
97.96ms |
81.38ms |
86.77ms |
104.69ms |
130.22ms |
Fixest |
33.6ms |
191.04ms |
64.38ms |
75.2ms |
102.18ms |
162.28ms |
Memory allocation for the same models
Alpaca |
282.78MB |
321.5MB |
270.4MB |
308MB |
366.5MB |
512.1MB |
Base R |
2.73GB |
2.6GB |
11.9GB |
11.9GB |
11.9GB |
12GB |
Capybara |
339.13MB |
196.3MB |
162.6MB |
169.1MB |
181.1MB |
239.9MB |
Fixest |
44.79MB |
36.6MB |
28.1MB |
32.4MB |
41.1MB |
62.9MB |
Continue reading: Introducing Capybara: Fast and Memory Efficient Fitting of Linear Models With High-Dimensional Fixed Effects
Capybara: A Robust Tool for GLM Estimation
Capybara represents a key advancement in software for estimating Generalized Linear Models (GLMs) from the Exponential Family and Negative Binomial models. Its key points of differentiation hinge on time and memory savings, minimal C++ code usage, and ease of maintenance. These facets of development have significant implications for its long-term adoption and use within the context of structural counterfactual analysis in International Trade, as well as other research fields that make broad use of GLMs.
Implications of Capybara’s Approaches
One of the most noteworthy underpinnings of Capybara is the efficiency it provides for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This is highly advantageous when estimating linear models with multiple group fixed effects.
The speed and small memory footprint are particularly impressive, underlining the benefits of a Halperin 1962 vector projections-based method. Traditional QR estimation, which could become unfeasible due to additional memory requirements, is thus surpassed by Capybara. In doing this, Capybara fortifies itself as a tool that could provide significant benefits for research economies going forward.
Future Developments: A Potential Goldmine of Enhancements
The present differentiators between Capybara and other software packages such as Alpaca and Base R suggest promising potential for enhancements. Given its comparatively lower need for memory allocation and faster processing times, Capybara could evolve to cater to more elaborate statistical analyses without the risk of compromising numerical stability.
The output quality of summary tables generated by Capybara is also an advantage, being similar to those from the Broom package and Stata outputs. This feature might encourage adoption by those who prefer cleaner, easily interpretable outputs. Future additions to this feature could be more customizations and improvements in formatting functions.
Actionable Advice
For Researchers: If you deal with estimation of GLMs from the Exponential Family and Negative Binomial models or directly involved in structural counterfactual analysis, adopting Capybara can likely enhance your productivity. Its succinct and efficient approach can save time, and its use of iterations in lieu of larger memory requirements means it can function exceptionally well even on low-memory systems.
For Developers: Albeit being a revelation for exemplary estimation, Capybara still utilizes elementary C++ code without parallelization. Work on integrating parallel workflows into the system, or optimizing the C++ code further for speedier process execution, could bring out even better results.
Conclusion
As a compact and memory-efficient tool, Capybara has much to offer in GLM estimations and beyond. Its novelty lies in its lean nature, the amenability of its processes and its systematic simplicity. Adapting it for mainstream use and tailoring it meticulously could reshape the way we view GLM estimations – for the better.
Read the original article
by jsendak | Jan 19, 2024 | DS Articles
As mentioned about a million times on this blog, last year I read Git in practice by Mike McQuaid and it changed my life – not only giving me bragging rights about the reading itself. I decided to give Pro Git by Scott Chacon a go too. It is listed in the resources section of the excellent “Happy Git with R” by Jenny Bryan, Jim Hester and others. For unclear reasons I bought the first edition instead of the second one.
Git as an improved filesystem
In the Chapter 1 (Getting Started), I underlined:
“[a] mini filesystem with some incredibly powerful tools built on top of it”.
Awesome diagrams
One of my favorite parts of the book were the diagrams such as the one illustrating “Git stores data as snapshots of the project over time”.
A reminder of why we use Git
“after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository.”
The last chapter in the book, “Git internals”, include a “Data recovery” section about git reflog
and git fsck
.
One can negate patterns in .gitignore
I did not know about this pattern format. In hindsight it is not particularly surprising.
git log
options
The format
option lets one tell Git how to, well, format the log. I mostly interact with the Git log through a GUI or the gert R package, but that’s good to know.
The book also describes how to filter commits in the log (by date, author, committer). I also never do that with Git itself, but who knows when it might become useful.
A better understanding of branches
I remember reading “branches are cheap” years ago and accepting this as fact without questioning the reason behind the statement. Now thanks to reading the “Git Branching” chapter, but also Julia Evans’ blog post “git branches: intuition & reality”, I know they are cheap because they are just a pointer to a commit.
Likewise, the phrase “fast forward” makes more sense after reading “Git moves the pointer forward”.
The “Git Branching” chapter is also a place where diagrams really shine.
Is rebase better than merge
“(…) rebasing makes for a cleaner history. If you examine the log of a rebased branch, it looks like a linear history.”
“Rebasing replays changes from one line of work onto another in the order they were introduced, whereas merging takes the endpoints and merges them together.”
Reading this reminds me of the (newish?) option for merging pull requests on GitHub, “Rebase and merge your commits” – as opposed to merge or squash&merge.
Don’t rebase commits and force push to a shared branch
The opportunity to plug another blog post by Julia Evans, “git rebase: what can go wrong?”.
The chance of getting the same SHA-1 twice in your repository
“A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.”
Ancestry references
The name of references with “^” or “~” (or both!) is “ancestry references”. Both HEAD~3
and HEAD^^^
are “the first parent of the first parent of the first parent”.
Double dot and triple dot
I do not intend to try and learn this but…
The double-dot syntax “asks Git to resolve a range of commits that are reachable from one commit but aren’t reachable from another”.
The triple-dot syntax “specifies all the commits that are reachable by either of the two references but not by both of them”.
New-to-me aspects of git rebase -i
git rebase -i
lists commits in the reverse order compared to git log
. I am not sure why I did not make a note of this before.
I had not really realized one could edit single commits in git rebase -i
. When writing “edit”, rebasing will stop at the commit one wants to edit. That strategy is actually featured in the GitHub blog post “Write Better Commits, Build Better Projects”.
Plumbing vs porcelaine
I had seen these terms before but never taken the time to look them up. Plumbing commands are the low-level commands, porcelain commands are the more user-friendly commands. At this stage, I do not think I need to be super familiar with plumbing commands, although I did click around in a .git
folder out of curiosity.
Removing objects
There is a section in the “Git internals” chapter called “Removing objects”. I might come back to it if I ever need to do that… Or I’d use git obliterate from the git-extras utilities!
Conclusion
Pro Git was a good read, although I do wish I had bought the second edition. I probably missed good stuff because of this! My next (and last?) Git book purchase will be Julia Evans’ new Git zine when it’s finished. I can’t wait!
Continue reading: Reading notes on Pro Git by Scott Chacon
Understanding Git: From Reading Notes on ‘Pro Git’ by Scott Chacon
Git, the all-important version control system, is often noted for its learning curve. A recent blog post delves into insights gained from reading ‘Pro Git’ by Scott Chacon, expanding our understanding of Git’s capabilities and tips for using it efficiently.
Git as an Improved Filesystem
“[a] mini filesystem with some incredibly powerful tools built on top of it.”
This sentiment resonates through the book, emphasizing the unique features of Git, most notably the way it stores data as snapshots over time. This nuanced way of data handling ensures it’s difficult to lose a commit and furthermore strengthens the structure with regular pushes to another repository.
The Importance of Good Comprehension
- Branches: The blog post highlights the importance of understanding Git-specific terms like ‘branches,’ referring to pointers to a commit.
- Fast Forward: This term, used when Git moves the pointer forward, is a sign that the user has a good grasp of basic Git operations.
- Rebase vs Merge: ‘Rebase’ is preferable for a cleaner history, resulting in a linear look when examining the log of a rebased branch. On the other hand, ‘merge’ takes the endpoints and merges them together.
The blog post urged caution towards rebasing commits and force pushing to a shared branch, stating that there could be potential risks involved in such tasks.
Familiarizing Oneself with Ancestry References
Ancestry references entail advanced usage of Git involving symbols like “^” or “~”. For instance, both HEAD~3 and HEAD^^^ point to “the first parent of the first parent of the first parent”.
Understanding Commit Syntax
Differentiating between double-dot and triple-dot syntax is important in Git. The former instructs Git to resolve a range of commits reachable from one commit but not from another, while the latter involves all commits reachable by either of two references, but not by both.
Editing with git rebase-i
‘git rebase -i’ enables editing of individual commits – this strategy is extensively covered in the GitHub blog post “Write Better Commits, Build Better Projects”. The command also lists commits in reverse order compared to git log.
Plumbing vs Porcelain Commands
These are Git’s low-level and user-friendly commands, respectively. Mastering both is crucial for a comprehensive understanding of Git.
Conclusion
Reading ‘Pro Git’ offers invaluable insights into Git management, highlighting aspects like rebasing, branching, commit editing and understanding various commands. Such knowledge can deeply improve one’s ability to navigate through version control systems when programming. As part of constant improvement and knowledge expansion, it’s advisable to keep exploring new resources and guides available in this area.
Read the original article
by jsendak | Jan 19, 2024 | DS Articles
Introduction
Yesterday I discussed the use of the function internal_make_wflw_predictions()
in the tidyAML
R package. Today I will discuss the use of the function extract_wflw_pred()
and the brand new function extract_regression_residuals()
in the tidyAML
R package. We breifly saw yesterday the output of the function internal_make_wflw_predictions()
which is a list of tibbles that are typically inside of a list column in the final output of fast_regression()
and fast_classification()
. The function extract_wflw_pred()
takes this list of tibbles and extracts them from that output. The function extract_regression_residuals()
also extracts those tibbles and has the added feature of also returning the residuals. Let’s see how these functions work.
The new function
First, we will go over the syntax of the new function extract_regression_residuals()
.
extract_regression_residuals(.model_tbl, .pivot_long = FALSE)
The function takes two arguments. The first argument is .model_tbl
which is the output of fast_regression()
or fast_classification()
. The second argument is .pivot_long
which is a logical argument that defaults to FALSE
. If TRUE
then the output will be in a long format. If FALSE
then the output will be in a wide format. Let’s see how this works.
Example
# Load packages
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model
tidymodels_prefer() # good practice when using tidyAML
rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
.data = mtcars,
.rec_obj = rec_obj,
.parsnip_eng = c("lm","glm","stan","gee"),
.parsnip_fns = "linear_reg"
)
Let’s break down the R code step by step:
- Loading Libraries:
library(tidyAML)
library(tidymodels)
library(tidyverse)
library(multilevelmod) # for the gee model
Here, the code is loading several R packages. These packages provide functions and tools for data analysis, modeling, and visualization. tidyAML
and tidymodels
are particularly relevant for modeling, while tidyverse
is a collection of packages for data manipulation and visualization. multilevelmod
is included for the Generalized Estimating Equations (gee) model.
-
Setting Preferences:
tidymodels_prefer() # good practice when using tidyAML
This line of code is setting preferences for the tidy modeling workflow using tidymodels_prefer()
. It ensures that when using tidyAML
, the tidy modeling conventions are followed. Tidy modeling involves an organized and consistent approach to modeling in R.
-
Creating a Recipe Object:
rec_obj <- recipe(mpg ~ ., data = mtcars)
Here, a recipe object (rec_obj
) is created using the recipe
function from the tidymodels
package. The formula mpg ~ .
specifies that we want to predict the mpg
variable based on all other variables in the dataset (mtcars
).
-
Performing Fast Regression:
frt_tbl <- fast_regression(
.data = mtcars,
.rec_obj = rec_obj,
.parsnip_eng = c("lm","glm","stan","gee"),
.parsnip_fns = "linear_reg"
)
This part involves using the fast_regression
function. It performs a fast regression analysis using various engines specified by .parsnip_eng
and specific functions specified by .parsnip_fns
. In this case, it includes linear models (lm
), generalized linear models (glm
), Stan models (stan
), and the Generalized Estimating Equations model (gee
). The results are stored in the frt_tbl
table.
In summary, the code is setting up a tidy modeling workflow, creating a recipe for predicting mpg
based on other variables in the mtcars
dataset, and then performing a fast regression using different engines and functions. The choice of engines and functions allows flexibility in exploring different modeling approaches.
Now that we have the output of fast_regression()
stored in frt_tbl
, we can use the function extract_wflw_pred()
to extract the predictions and from the output. Let’s see how this works. First, the syntax:
extract_wflw_pred(.data, .model_id = NULL)
The function takes two arguments. The first argument is .data
which is the output of fast_regression()
or fast_classification()
. The second argument is .model_id
which is a numeric vector that defaults to NULL
. If NULL
then the function will extract none of the predictions from the output. If a numeric vector is provided then the function will extract the predictions for the models specified by the numeric vector. Let’s see how this works.
extract_wflw_pred(frt_tbl, 1)
# A tibble: 64 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 54 more rows
extract_wflw_pred(frt_tbl, 1:2)
# A tibble: 128 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 118 more rows
extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
# A tibble: 256 × 4
.model_type .data_category .data_type .value
<chr> <chr> <chr> <dbl>
1 lm - linear_reg actual actual 15.2
2 lm - linear_reg actual actual 10.4
3 lm - linear_reg actual actual 33.9
4 lm - linear_reg actual actual 32.4
5 lm - linear_reg actual actual 16.4
6 lm - linear_reg actual actual 21.5
7 lm - linear_reg actual actual 15.8
8 lm - linear_reg actual actual 15
9 lm - linear_reg actual actual 14.7
10 lm - linear_reg actual actual 10.4
# ℹ 246 more rows
The first line of code extracts the predictions for the first model in the output. The second line of code extracts the predictions for the first two models in the output. The third line of code extracts the predictions for all models in the output.
Now, let’s visualize the predictions for the models in the output and the actual values. We will use the ggplot2
package for visualization. First, we will extract the predictions for all models in the output and store them in a table called pred_tbl
. Then, we will use ggplot2
to visualize the predictions and actual values.
pred_tbl <- extract_wflw_pred(frt_tbl, 1:nrow(frt_tbl))
pred_tbl |>
group_split(.model_type) |>
map((x) x |>
group_by(.data_category) |>
mutate(x = row_number()) |>
ungroup() |>
pivot_wider(names_from = .data_type, values_from = .value) |>
ggplot(aes(x = x, y = actual, group = .data_category)) +
geom_line(color = "black") +
geom_line(aes(x = x, y = training), linetype = "dashed", color = "red",
linewidth = 1) +
geom_line(aes(x = x, y = testing), linetype = "dashed", color = "blue",
linewidth = 1) +
theme_minimal() +
labs(
x = "",
y = "Observed/Predicted Value",
title = "Observed vs. Predicted Values by Model Type",
subtitle = x$.model_type[1]
)
)
Or we can facet them by model type:
pred_tbl |>
group_by(.model_type, .data_category) |>
mutate(x = row_number()) |>
ungroup() |>
ggplot(aes(x = x, y = .value)) +
geom_line(data = . %>% filter(.data_type == "actual"), color = "black") +
geom_line(data = . %>% filter(.data_type == "training"),
linetype = "dashed", color = "red") +
geom_line(data = . %>% filter(.data_type == "testing"),
linetype = "dashed", color = "blue") +
facet_wrap(~ .model_type, ncol = 2, scales = "free") +
labs(
x = "",
y = "Observed/Predicted Value",
title = "Observed vs. Predicted Values by Model Type"
) +
theme_minimal()
Ok, so what about this new function I talked about above? Well let’s go over it here. We have already discussed it’s syntax so no need to go over it again. Let’s just jump right into an example. This function will return the residuals for all models. We will slice off just the first model for demonstration purposes.
extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = FALSE)[[1]]
# A tibble: 32 × 4
.model_type .actual .predicted .resid
<chr> <dbl> <dbl> <dbl>
1 lm - linear_reg 15.2 17.3 -2.09
2 lm - linear_reg 10.4 11.9 -1.46
3 lm - linear_reg 33.9 30.8 3.06
4 lm - linear_reg 32.4 28.0 4.35
5 lm - linear_reg 16.4 15.0 1.40
6 lm - linear_reg 21.5 22.3 -0.779
7 lm - linear_reg 15.8 17.2 -1.40
8 lm - linear_reg 15 15.1 -0.100
9 lm - linear_reg 14.7 10.9 3.85
10 lm - linear_reg 10.4 10.8 -0.445
# ℹ 22 more rows
Now let’s set .pivot_long = TRUE
:
extract_regression_residuals(.model_tbl = frt_tbl, .pivot_long = TRUE)[[1]]
# A tibble: 96 × 3
.model_type name value
<chr> <chr> <dbl>
1 lm - linear_reg .actual 15.2
2 lm - linear_reg .predicted 17.3
3 lm - linear_reg .resid -2.09
4 lm - linear_reg .actual 10.4
5 lm - linear_reg .predicted 11.9
6 lm - linear_reg .resid -1.46
7 lm - linear_reg .actual 33.9
8 lm - linear_reg .predicted 30.8
9 lm - linear_reg .resid 3.06
10 lm - linear_reg .actual 32.4
# ℹ 86 more rows
Now let’s visualize the data:
resid_tbl <- extract_regression_residuals(frt_tbl, TRUE)
resid_tbl |>
map((x) x |>
group_by(name) |>
mutate(x = row_number()) |>
ungroup() |>
mutate(plot_group = ifelse(name == ".resid", "Residuals", "Actual and Predictions")) |>
ggplot(aes(x = x, y = value, group = name, color = name)) +
geom_line() +
theme_minimal() +
facet_wrap(~ plot_group, ncol = 1, scales = "free") +
labs(
x = "",
y = "Value",
title = "Actual, Predicted, and Residual Values by Model Type",
subtitle = x$.model_type[1],
color = "Data Type"
)
)
And that’s it!
Thank you for reading and I would love to hear your feedback. Please feel free to reach out to me.
Continue reading: The new function on the block with tidyAML extract_regression_residuals()
Understanding the New Functions in the tidyAML R Package
The post discusses two new functions that have been introduced in the tidyAML R package, namely extract_wflw_pred() and extract_regression_residuals(). These functions are useful for extracting certain data from the output of some specific predictive analysis models.
Function Overview
The function extract_wflw_pred() takes a list of tibbles, typically found in the final output of fast_regression() and fast_classification() functions, and extracts them. Furthermore, the function extract_regression_residuals() performs a similar task but has an additional feature of returning residuals. The feature consequently aids in the further analysis process by revealing the difference between predicted and actual values.
Using the New Functions
The primary use of these added functions involves invoking them after implementing respective predictive models. After using the fast_regression() or fast_classification() function, you can directly use these functions on the output data. The users can select the data format in extract_regression_residuals(); this mainly involves selecting between wide or long formats.
Implementation Example
The blog post provides a comprehensive example of utilizing these functions using the mtcars dataset. Starting from preparing the data for modeling using a tidymodels and tidyAML workflow up to performing the regression and extraction of predictions.
Three scenarios demonstrated the use of extract_wflw_pred(). The first saw an extraction of predictions for the first model in the output, while the second extracted predictions for the first two models. The third example extracted predictions of all models in the output.
The Long-term Implications and Possible Future Developments
With the addition of these new functions, users can anticipate better data extraction process from their predictive analysis models. The ability for extract_wflw_pred() to specify the model from which you wish to extract predictions, and extract_regression_residuals()’s optional output formatting and provision of residuals can be considered as significant strides in predictive data analysis.
As for future developments, there may be an introduction of more functions that further enhance the extraction process or provide more data insights. Moreover, improvements or updates on these functions can also be looked upon. Additionally, an interesting prospect for upcoming additions could be a function that automatically optimizes or selects the best predictive model based on certain criteria.
Actionable Advice
In light of these insights, it is crucial for users working with predictive data analysis models to understand how these new functions operate and hence gain maximum benefit out of them:
- Be sure to understand both the extract_wflw_pred() and extract_regression_residuals() functions and what they can offer.
- Explore different scenarios for using these functions, like extracting predictions from different models, varying the data output format.
- Exploit the residuals provided by extract_regression_residuals() to enhance your understanding and prediction capability of your models.
By doing so, you will be able to harness these additions fully to optimize your work with the tidyAML R package.
Read the original article
by jsendak | Jan 19, 2024 | DS Articles
Path to a Free Self-Taught Education in Data Science for Everyone.
Implications and Future Trends in Self-Taught Data Science Education
The surge of interest in data science has sparked a revolution in self-teaching methods, driven by the enormous appetite for this field of study in the ever-evolving tech industry. This free, accessible, and self-directed education in data science has profound long-term implications and accelerates emerging trends.
Long-term Implications
Democratizing education, particularly in a highly technical field like data science, can shift the workforce landscape significantly. By providing free resources and tools to anyone interested, we foster a larger, more diverse talent pool. These self-taught data scientists offer unique perspectives and problem-solving approaches based on their myriad backgrounds and experiences.
“The more variety we have in our problem solvers, the more we will see of innovative solutions to the complex issues riddling our world.”
Future Trends
As more people turn to self-guided learning paths, we can expect an increased transformation in the way education is delivered. Traditional brick and mortar institutions may give way to online platforms that offer flexible learning schedules and customized curriculums. Industries will continue to seek professionals who are proactive, self-motivated, and capable of learning autonomously.
Actionable Advice
If you’re considering self-education in data science, consider these steps:
- Start with Basics: Begin with fundamental concepts such as statistics and programming before delving into more complex data science domains.
- Use Free Resources: Leverage open-source platforms and free resources available online to guide your learning journey.
- Engage with Community: Be active in online data science communities and discussion boards. Networking with industry professionals and peers can offer guidance and support.
- Stay Updated: Constantly update and upskill yourself in this rapidly evolving field. Attend webinars, read articles, and follow trends and developments.
- Apply Knowledge: Look for opportunities to apply what you’ve learned in real-world scenarios. Participating in data science competitions can help sharpen your skills.
Conclusion
Overall, the encouraging tendency towards democratizing data science education is ushering in a new era of learning and problem-solving. It not only offers tools to empower individuals but also helps create a more diverse and innovative workforce.
Read the original article
by jsendak | Jan 19, 2024 | DS Articles
Bluesky is shaping up to be a nice, “billionaire-proof” replacement of what Twitter once was.
To bring back a piece of the thriving R community that once existed on ex-Twitter, I decided to bring back the R-Bloggers bot, which spread the word about blog posts from many R users and developers.
Especially when first learning R, this was a very important resource for me and I created my first package using a post from R-Bloggers.
Since I have recently published the atrrr
package with a few friends, I thought it was a good opportunity to promote that package and show how you can write a completely free bot with it.
You can find the bot at https://github.com/JBGruber/r-bloggers-bluesky.
This posts describes how the parts fit together.
Writing the R-bot
The first part of my bot is a minimal RSS parser to get new posts from http://r-bloggers.com.
You can parse the content of an RSS feed with packages like tidyRSS
, but I wanted to keep it minimal and not have too many packages in the script.
I won’t spend too much time on this part, because it will be different for other bots.
However, if you want to build a bot to promote content on your own website or your podcast, RSS is well-suited for that and often easier to parse than HTML.
## packages
library(atrrr)
library(anytime)
library(dplyr)
library(stringr)
library(glue)
library(purrr)
library(xml2)
## Part 1: read RSS feed
feed <- read_xml("http://r-bloggers.com/rss")
# minimal custom RSS reader
rss_posts <- tibble::tibble(
title = xml_find_all(feed, "//item/title") |>
xml_text(),
creator = xml_find_all(feed, "//item/dc:creator") |>
xml_text(),
link = xml_find_all(feed, "//item/link") |>
xml_text(),
ext_link = xml_find_all(feed, "//item/guid") |>
xml_text(),
timestamp = xml_find_all(feed, "//item/pubDate") |>
xml_text() |>
utctime(tz = "UTC"),
description = xml_find_all(feed, "//item/description") |>
xml_text() |>
# strip html from description
vapply(function(d) {
read_html(d) |>
xml_text() |>
trimws()
}, FUN.VALUE = character(1))
)
To create the posts for Bluesky, we have to keep in mind that the platform has a 300 character limit per post.
I want the posts to look like this:
title
first sentences of post
post URL
The first sentence of the post needs to be trimmed then to 300 characters minus the length of the title and URL.
I calculate the remaining number of characters and truncate the post description, which contains the entire text of the post in most cases.
## Part 2: create posts from feed
posts <- rss_posts |>
# measure length of title and link and truncate description
mutate(desc_preview_len = 294 - nchar(title) - nchar(link),
desc_preview = map2_chr(description, desc_preview_len, function(x, y) str_trunc(x, y)),
post_text = glue("{title}nn"{desc_preview}"nn{link}"))
I’m pretty proud of part 3 of the bot:
it checks the posts on the timeline (excuse me, I meant skyline) of the bot (with the handle r-bloggers.bsky.social
) and discards all posts that are identical to posts already on the timeline.
This means the bot does not need to keep any storage of previous runs.
It essentially uses the actual timeline as its database of previous posts.
Don’t mind the Sys.setenv
and auth
part, I will talk about them below.
## Part 3: get already posted updates and de-duplicate
Sys.setenv(BSKY_TOKEN = "r-bloggers.rds")
auth(user = "r-bloggers.bsky.social",
password = Sys.getenv("ATR_PW"),
overwrite = TRUE)
old_posts <- get_skeets_authored_by("r-bloggers.bsky.social", limit = 5000L)
posts_new <- posts |>
filter(!post_text %in% old_posts$text)
To post from an account on Bluesky, the bot uses the function post_skeet
(a portmanteau of “sky” + “twee.. I mean”posting”).
Unlike most social networks, Bluesky allows users to backdate posts (the technical reasons are too much to go into here).
So I thought it would be nice to make it look like the publication date of the blog post was also when the post on Bluesky was made.
## Part 4: Post skeets!
for (i in seq_len(nrow(posts_new))) {
post_skeet(text = posts_new$post_text[i],
created_at = posts_new$timestamp[i])
}
Update: after a day of working well, the bot ran into a problem where a specific post used a malformed GIF image as header image, resulting in:
## ✖ Something went wrong [605ms]
## Error: insufficient image data in file `/tmp/Rtmp8Gat9r/file7300766c1e29c.gif' @ error/gif.c/ReadGIFImage/1049
So I introduced some error handling with try
:
## Part 4: Post skeets!
for (i in seq_len(nrow(posts_new))) {
# if people upload broken preview images, this fails
resp <- try(post_skeet(text = posts_new$post_text[i],
created_at = posts_new$timestamp[i]))
if (methods::is(resp, "try-error")) post_skeet(text = posts_new$post_text[i],
created_at = posts_new$timestamp[i],
preview_card = FALSE)
}
Deploying the bot on GitHub
Now I can run this script on my computer and the r-bloggers.bsky.social
will post about all blog post currently in feed on http://r-bloggers.com/rss!
But for an actual bot, this needs to run not once but repeatedly!
So the choice is to either deploy this on a computer that is on 24/7, like a server.
You can get very cheap computers to do that for you, but you can also do it completely for free running it on someone else’s server (like a pro).
One such way is through Github Actions.
To do that, you need to create a free account and move the bot script into a repo.
You then need to define an “Action” which is a pre-defined script that sets up all the neccesary dependencies and then executes a task.
You can copy and paste the action file from https://github.com/JBGruber/r-bloggers-bluesky/blob/main/.github/workflows/bot.yml into the folder .github/workflows/
of your repo:
name: "Update Bot"
on:
schedule:
- cron: '0 * * * *' # run the bot once an hour (at every minute 0 on the clock)
push: # also run the action when something on a new commit
branches:
- main
pull_request:
branches:
- main
jobs:
blog-updates:
name: bot
runs-on: ubuntu-latest
steps:
# you can use this action to install R
- uses: r-lib/actions/setup-r@v2
with:
r-version: 'release'
# this one makes sure the files from your repo are accessible
- name: Setup - Checkout repo
uses: actions/checkout@v2
# these dependencies are needed for pak to install packages
- name: System dependencies
run: sudo apt-get install -y libcurl4-openssl-dev
# I created this custom installation of depenencies since the pre-pacakged one
# from https://github.com/r-lib/actions only works for repos containing R packages
- name: "Install Packages"
run: |
install.packages(c("pak", "renv"))
deps <- unique(renv::dependencies(".")$Package)
# use github version for now
deps[deps == "atrrr"] <- "JBGruber/atrrr"
deps <- c(deps, "jsonlite", "magick", "dplyr")
# should handle remaining system requirements automatically
pak::pkg_install(deps)
shell: Rscript {0}
# after all the preparation, it's time to run the bot
- name: "Bot - Run"
run: Rscript 'bot.r'
env:
ATR_PW: ${{ secrets.ATR_PW }} # to authenticat, store your app pw as a secret
Authentication
We paid close attention to make it as easy as possible to authenticate yourself using atrrr
.
However, on a server, you do not have a user interface and can’t enter a password.
However, you also do not want to make your key public!
So after following the authentication steps, you want to put your bot’s password into .Renviron
file (e.g., by using usethis::edit_r_environ()
).
The you can use Sys.getenv("ATR_PW")
to get the password in R.
Using the auth
function, you can explitily provide your username and password to authenticate your bot to Bluesky without manual intervention.
To not interfere with my main Bluesky account, I also set the variable BSKY_TOKEN
which defines the file name of your token in the current session.
Which leads us to the code you saw earlier.
Sys.setenv(BSKY_TOKEN = "r-bloggers.rds")
auth(user = "r-bloggers.bsky.social",
password = Sys.getenv("ATR_PW"),
overwrite = TRUE)
Then, the final thing to do before uploading everything and running your bot n GitHub for the first time is to make sure the Action script has access to the environment variable (NEVER commit your .Renviron
to GitHub!).
The way you do this is by nagvigating to /settings/secrets/actions
in your repository and define a repository secret with the name ATR_PW
and your Bluesky App password as the value.
And that is it.
A free Bluesky bot in R
!
Continue reading: Building the R-Bloggers Bluesky Bot with atrrr and GitHub Actions
R-Blogger’s Bluesky Rise: An Analysis of the Advancements
In an effort to recreate the former bustling R community on Twitter, the R-Bloggers bot has found a new social platform: Bluesky. Billed as a billionaire-proof social networking space, Bluesky seems poised to be a potential alternative to Twitter. This bot assimilates user blog posts and works towards spreading and promoting them amongst the community, acting as an integral resource, particularly for beginners learning R.
Distinctive Features of the Bot
The R-Blogger bot is based on an RSS parser that extracts new blog posts from a source like r-bloggers.com. The bot utilizes several packages, such as atrrr, anytime, dplyr, stringr, glue, purrr, and xml2, keeping extra packages to a minimum to reduce script length. The RSS parser is designed in such a way that it can be adapted to personal websites or podcasts to promote them effectively. The user-friendly setup makes the process of creating bot posts straightforward.
Challenges and Solutions
The bot’s efficiency faces challenges when it encounters broken images uploaded by users. A faulty GIF image led to a situation where the bot responded with an error message. To deal with such errors, the developer has included an error handling step with ‘try’ which ensures smooth operation even when faced with corrupted images.
Deployment through GitHub: An Easy Solution for Continual Operation
To sustain continual operation of the bot and to ensure it does not just run once on a personal computer, deploying it on GitHub via Actions is recommended. This not only ensures the bot’s incessant operation but also does so free of cost on GitHub’s own server. However, one must remember that every package added for caching needs to be installed on each GitHub Actions run, extending the run time of the bot.
Protecting User Credentials
Github also provides an intensive process of authentication. The server does not have a user interface to input a password, and hence, to secure user keys and shield them from becoming public, the bot’s password is put into a .Renviron file. This can also authenticate the bot to Bluesky without any manual intervention.
Future Implications
Considering its ease of use and much-needed resource for R beginners, the revival of the R-Bloggers bot on Bluesky marks a significant development. The convenience of deploying the bot through GitHub further increases its feasibility.
Actionable Advice
Budding developers or those looking to learn R could utilize this bot to ease their learning process and create repositories. With this efficient bot, users can better promote their content and engage more effectively with the community. It is advisable to keep a close watch on the development of this project and its adaptations for other platforms in the near future.
Read the original article
by jsendak | Jan 19, 2024 | DS Articles
Want to refresh your SQL skills? Bookmark these useful cheat sheets covering SQL basics, joins, window functions, and more.
Future Developments and Long-Term Implications: SQL Skills Enhancement
As technological advancements and data-driven decision making continue to reshape the business landscape, SQL (Structured Query Language) remains an essential skill for anyone dealing with data. A substantial article recently suggested reading and bookmarking cheat sheets to help refresh SQL skills covering basics, joins, window functions and more. This post will delve deeper into the long-term implications of this ever-growing need for refined SQL skills and discuss future possibilities in this domain.
Long-Term Implications
The crux of data management, rooted in SQL, extends way beyond ordinary database management. Knowledge and proficiency in SQL can significantly enhance one’s capacity to handle complex data manipulation tasks, thereby contributing to an organization’s decision-making process. In the long run, companies and individuals who master SQL can expect an upward trajectory in their ability to analyze data, resulting in improved business operations and strategy.
Furthermore, SQL is not just a tool for programmers or data analysts. Various job roles, including digital marketers, product managers, and even UX designers, are increasingly finding SQL skills advantageous in their daily tasks. Therefore, the prevalence of SQL is expected to broaden as we move further into a data-centric era.
Possible Future Developments
The significance of SQL in the future is anticipated to be immense as we continue to enter a data-dominant world. Potential advancements may involve expanded use-case scenarios and the integration of SQL with emerging technologies.
An example of such a development could be the introduction of SQL capabilities into machine learning frameworks, thus enabling more efficient data analysis capabilities. Additionally, the rise of cybersecurity concerns may stimulate increased demand for SQL professionals adept at ensuring data integrity and security.
Actionable Advice
To adapt to the growing importance of SQL skills, key actionable advice includes:
- Continual Learning: Regularly updating and sharpening your SQL skills will ensure you remain relevant and valuable in your industry. The recommendation to use cheat sheets is an excellent directive towards achieving this.
- Integration with Other Skills: Explore the integration of SQL with other arenas such as machine learning, to expand your professional capabilities. This will set the stage for future growth and opportunities.
- Data Security: With increasing cybersecurity threats, learning how SQL can protect and ensure the integrity of data would be a valuable addition to your skills.
The increasing prevalence of data in almost all aspects of business justifies the need to continually enhance SQL skills. By staying proactive in mastering SQL, individuals can prepare themselves for a myriad of future possibilities and advancements in the field of data and technology.
Read the original article