“Capybara: Efficient GLM Estimation with High-Dimensional Fixed Effects”

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

About

Capybara is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This technique is particularly useful when estimating linear models with multiple group fixed effects.

The software can estimate GLMs from the Exponential Family and also Negative Binomial models but the focus will be the Poisson estimator because it is the one used for structural counterfactual analysis in International Trade. It is relevant to add that the IWLS estimator is equivalent with the PPML estimator from Santos-Silva et al. 2006

Tradition QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin 1962 article on vector projections offers important time and memory savings without compromising numerical stability in the estimation process.

The software heavily borrows from Gaure 20213 and Stammann 2018 works on the OLS and IWLS estimator with large k-way fixed effects (i.e., the Lfe and Alpaca packages). The differences are that Capybara uses an elementary approach and uses a minimal C++ code without parallelization, which achieves very good results considering its simplicity. I hope it is east to maintain.

The summary tables are nothing like R’s default and borrow from the Broom package and Stata outputs. The default summary from this package is a Markdown table that you can insert in RMarkdown/Quarto or copy and paste to Jupyter.

Demo

Estimating the coefficients of a gravity model with importer-time and exporter-time fixed effects.

library(capybara)

mod <- feglm(
  trade ~ dist + lang + cntg + clny | exp_year + imp_year,
  trade_panel,
  family = poisson(link = "log")
)

summary(mod)

Formula: trade ~ dist + lang + cntg + clny | exp_year + imp_year

Family: Poisson

Estimates:

|      | Estimate | Std. error | z value    | Pr(> |z|)  |
|------|----------|------------|------------|------------|
| dist |  -0.0006 |     0.0000 | -9190.4389 | 0.0000 *** |
| lang |  -0.1187 |     0.0006 |  -199.7562 | 0.0000 *** |
| cntg |  -1.3420 |     0.0005 | -2588.1870 | 0.0000 *** |
| clny |  -1.0226 |     0.0009 | -1134.1855 | 0.0000 *** |

Significance codes: *** 99.9%; ** 99%; * 95%; . 90%

Number of observations: Full 28566; Missing 0; Perfect classification 0

Number of Fisher Scoring iterations: 9

Installation

You can install the development version of capybara like so:

remotes::install_github("pachadotdev/capybara")

Examples

See the documentation in progress: https://pacha.dev/capybara.

Benchmarks

Median time for the different models in the book An Advanced Guide to Trade Policy Analysis.

package	PPML	Trade Diversion	Endogeneity	Reverse Causality	Non-linear/Phasing Effects	Globalization
Alpaca	282ms	1.78s	1.1s	1.34s	2.18s	4.48s
Base R	36.2s	36.87s	9.81m	10.03m	10.41m	10.4m
Capybara	159.2ms	97.96ms	81.38ms	86.77ms	104.69ms	130.22ms
Fixest	33.6ms	191.04ms	64.38ms	75.2ms	102.18ms	162.28ms

Memory allocation for the same models

package	PPML	Trade Diversion	Endogeneity	Reverse Causality	Non-linear/Phasing Effects	Globalization
Alpaca	282.78MB	321.5MB	270.4MB	308MB	366.5MB	512.1MB
Base R	2.73GB	2.6GB	11.9GB	11.9GB	11.9GB	12GB
Capybara	339.13MB	196.3MB	162.6MB	169.1MB	181.1MB	239.9MB
Fixest	44.79MB	36.6MB	28.1MB	32.4MB	41.1MB	62.9MB

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Introducing Capybara: Fast and Memory Efficient Fitting of Linear Models With High-Dimensional Fixed Effects

Capybara: A Robust Tool for GLM Estimation

Capybara represents a key advancement in software for estimating Generalized Linear Models (GLMs) from the Exponential Family and Negative Binomial models. Its key points of differentiation hinge on time and memory savings, minimal C++ code usage, and ease of maintenance. These facets of development have significant implications for its long-term adoption and use within the context of structural counterfactual analysis in International Trade, as well as other research fields that make broad use of GLMs.

Implications of Capybara’s Approaches

One of the most noteworthy underpinnings of Capybara is the efficiency it provides for demeaning variables before conducting a GLM estimation via Iteratively Weighted Least Squares (IWLS). This is highly advantageous when estimating linear models with multiple group fixed effects.

The speed and small memory footprint are particularly impressive, underlining the benefits of a Halperin 1962 vector projections-based method. Traditional QR estimation, which could become unfeasible due to additional memory requirements, is thus surpassed by Capybara. In doing this, Capybara fortifies itself as a tool that could provide significant benefits for research economies going forward.

Future Developments: A Potential Goldmine of Enhancements

The present differentiators between Capybara and other software packages such as Alpaca and Base R suggest promising potential for enhancements. Given its comparatively lower need for memory allocation and faster processing times, Capybara could evolve to cater to more elaborate statistical analyses without the risk of compromising numerical stability.

The output quality of summary tables generated by Capybara is also an advantage, being similar to those from the Broom package and Stata outputs. This feature might encourage adoption by those who prefer cleaner, easily interpretable outputs. Future additions to this feature could be more customizations and improvements in formatting functions.

Actionable Advice

For Researchers: If you deal with estimation of GLMs from the Exponential Family and Negative Binomial models or directly involved in structural counterfactual analysis, adopting Capybara can likely enhance your productivity. Its succinct and efficient approach can save time, and its use of iterations in lieu of larger memory requirements means it can function exceptionally well even on low-memory systems.

For Developers: Albeit being a revelation for exemplary estimation, Capybara still utilizes elementary C++ code without parallelization. Work on integrating parallel workflows into the system, or optimizing the C++ code further for speedier process execution, could bring out even better results.

Conclusion

As a compact and memory-efficient tool, Capybara has much to offer in GLM estimations and beyond. Its novelty lies in its lean nature, the amenability of its processes and its systematic simplicity. Adapting it for mainstream use and tailoring it meticulously could reshape the way we view GLM estimations – for the better.

Read the original article