Fast Grouped Counts and Means in R

[This article was first published on R – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

From time to time, the following questions pop up:

How to calculate grouped counts and (weighted) means?
What are fast ways to do it in R?

This blog post presents a couple of approaches and then compares their speed with a naive benchmark.

Base R

There are many ways to calculate grouped counts and means in base R, e.g., aggregate(), tapply(), by(), split() + lapply(). In my experience, the fastest way is a combination of tabulate() and rowsum().

# Make data
set.seed(1)

n <- 1e6

y <- rexp(n)
w <- runif(n)
g <- factor(sample(LETTERS[1:3], n, TRUE))
df <- data.frame(y = y, g = g, w = w)

# Grouped counts
tabulate(g)
# 333469 333569 332962

# Grouped means
rowsum(y, g) / tabulate(g)
      [,1]
# A 1.000869
# B 1.001043
# C 1.000445

# Grouped weighted mean
ws <- rowsum(data.frame(y = y * w, w), g)
ws[, 1L] / ws[, 2L]
# 1.0022749 1.0017816 0.9997058

But: tabulate() ignores missing values. To avoid problems, create an explicit missing level via factor(x, exclude = NULL).

Let’s turn to some other approaches.

dplyr

Not optimized for speed or memory, but the de-facto standard in data processing with R. I love its syntax.

library(tidyverse)

df <- tibble(df)

# Grouped counts
dplyr::count(df, g)

# Grouped means
df |>
  group_by(g) |>
  summarize(mean(y))

# Grouped weighted means
df |>
  group_by(g) |>
  summarize(sum(w * y) / sum(w))

data.table

Does not need an introduction. Since 2006 the package for fast data manipulation written in C.

library(data.table)

dt <- data.table(df)

# Grouped counts (use keyby for sorted output)
dt[, .N, by = g]
#         g      N
#    <fctr>  <int>
# 1:      C 332962
# 2:      B 333569
# 3:      A 333469

# Grouped means
dt[, mean(y), by = g]

# Grouped weighted means
dt[, sum(w * y) / sum(w), by = g]
dt[, weighted.mean(y, w), by = g]

DuckDB

Extremely powerful query engine / database system written in C++, with initial release in 2019, and R bindings since 2020. Allows larger-than-RAM calculations.

library(duckdb)

con <- dbConnect(duckdb())

duckdb_register(con, name = "df", df = df)

dbGetQuery(con, "SELECT g, COUNT(*) N FROM df GROUP BY g")
dbGetQuery(con, "SELECT g, AVG(y) AS mean FROM df GROUP BY g")
con |>
  dbGetQuery(
  "
  SELECT g, SUM(y * w) / sum(w) as wmean
  FROM df
  GROUP BY g
  "
  )
#   g     wmean
# 1 A 1.0022749
# 2 B 1.0017816
# 3 C 0.9997058

collapse

C/C++-based package for data transformation and statistical computing. {collapse} was initially released on CRAN in 2020. It can do much more than grouped calculations, check it out!

library(collapse)

fcount(g)
fnobs(g, g) # Faster and does not need memory, but ignores missing values
fmean(y, g = g)
fmean(y, g = g, w = w)
#         A         B         C
# 1.0022749 1.0017816 0.9997058

Polars

R bindings of the fantastic Polars project that started in 2020. First R release in 2022. About to be overhauled into the R package {neopandas} .

# Sys.setenv(NOT_CRAN = "true")
# install.packages("polars", repos = "https://community.r-multiverse.org")
library(polars)

dfp <- as_polars_df(df)

# Grouped counts
dfp$get_column("g")$value_counts()
# Faster, but eats more memory
dfp$select("g")$with_columns(pl$lit(1L))$group_by("g")$sum()

# Grouped means
dfp$select(c("g", "y"))$group_by("g")$mean()

# Grouped weighted means
(
  dfp
  $with_columns(pl$col("y") * pl$col("w"))
  $group_by("g")
  $sum()
  $with_columns(pl$col("y") / pl$col("w"))
  $drop("w")
)
# shape: (3, 2)
# ┌─────┬──────────┐
# │ g   ┆ y        │
# │ --- ┆ ---      │
# │ cat ┆ f64      │
# ╞═════╪══════════╡
# │ A   ┆ 1.002275 │
# │ B   ┆ 1.001782 │
# │ C   ┆ 0.999706 │
# └─────┴──────────┘

Naive Benchmark

Let’s compare the speed of these approaches for sample sizes up to 10^8 using a Windows system with an Intel i7-13700H CPU.

# We run the code in a fresh session
library(tidyverse)
library(duckdb)
library(data.table)
library(collapse)
library(polars)

polars_info() # 8 threads
setDTthreads(8)
con <- dbConnect(duckdb(config = list(threads = "8")))

set.seed(1)

N <- 10^(5:8)
m_queries <- 3
results <- vector("list", length(N) * m_queries)

for (i in seq_along(N)) {
  n <- N[i]

  # Create data
  y <- rexp(n)
  w <- runif(n)
  g <- factor(sample(LETTERS, n, TRUE))

  df <- tibble(y = y, g = g, w = w)
  dt <- data.table(df)
  dfp <- as_polars_df(df)
  duckdb_register(con, name = "df", df = df, overwrite = TRUE)

  # Grouped counts
  results[[1 + (i - 1) * m_queries]] <- bench::mark(
    base = tabulate(g),
    dplyr = dplyr::count(df, g),
    data.table = dt[, .N, by = g],
    polars = dfp$get_column("g")$value_counts(),
    collapse = fcount(g),
    duckdb = dbGetQuery(con, "SELECT g, COUNT(*) N FROM df GROUP BY g"),
    check = FALSE,
    min_iterations = 3,
  ) |>
    bind_cols(n = n, query = "counts")

  results[[2 + (i - 1) * m_queries]] <- bench::mark(
    base = rowsum(y, g) / tabulate(g),
    dplyr = df |> group_by(g) |> summarize(mean(y)),
    data.table = dt[, mean(y), by = g],
    polars = dfp$select(c("g", "y"))$group_by("g")$mean(),
    collapse = fmean(y, g = g),
    duckdb = dbGetQuery(con, "SELECT g, AVG(y) AS mean FROM df GROUP BY g"),
    check = FALSE,
    min_iterations = 3
  ) |>
    bind_cols(n = n, query = "means")

  results[[3 + (i - 1) * m_queries]] <- bench::mark(
    base = {
      ws <- rowsum(data.frame(y = y * w, w), g)
      ws[, 1L] / ws[, 2L]
    },
    dplyr = df |> group_by(g) |> summarize(sum(w * y) / sum(w)),
    data.table = dt[, sum(w * y) / sum(w), by = g],
    polars = (
      dfp
      $with_columns(pl$col("y") * pl$col("w"))
      $group_by("g")
      $sum()
      $with_columns(pl$col("y") / pl$col("w"))
      $drop("w")
    ),
    collapse = fmean(y, g = g, w = w),
    duckdb = dbGetQuery(
      con,
      "SELECT g, SUM(y * w) / sum(w) as wmean FROM df GROUP BY g"
    ),
    check = FALSE,
    min_iterations = 3
  ) |>
    bind_cols(n = n, query = "weighted means")
}

results_df <- bind_rows(results) |>
  group_by(n, query) |>
  mutate(
    time = median,
    approach = as.character(expression),
    relative = as.numeric(time / min(time))
  ) |>
  ungroup()

ggplot(results_df, aes(y = relative, x = query, group = approach, color = approach)) +
  geom_point() +
  geom_line() +
  facet_wrap("n", scales = "free_y") +
  labs(x = element_blank(), y = "Relative timings") +
  theme_gray(base_size = 14)

ggplot(results_df, aes(y = time, x = query, group = approach, color = approach)) +
  geom_point() +
  geom_line() +
  facet_wrap("n", scales = "free_y") +
  labs(x = element_blank(), y = "Absolute time in seconds") +
  theme_gray(base_size = 14)

Absolute time in seconds. For relative time, check the plot at the top.

Memory

What about memory? {dplyr}, {data.table}, and rowsum() require a lot of it, as does collapse::fcount(). For the other approaches, almost no memory is required, or profmem can’ t measure it.

Final words

{collapse} is increadibly fast for all sample sizes and tasks. In other benchmarks, it is slower because there, the grouping has to be a string rather than a factor.
{duckdb} is increadibly fast for large data.
{polars} looks really cool.
rowsum() and tabulate() provide fast solutions with base R.
Don’t trust my benchmarks!

R script

To leave a comment for the author, please follow the link and comment on their blog: R – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Fast Grouped Counts and Means in R

Analysis of Fast Grouped Counts and Means in R

In an attempt to find viable solutions to the regular data processing issues regarding grouped counts and means, this blog post presents and tests different methods and modules available in R to accomplish this task. These include base R, dplyr, data.table, DuckDB, collapse, and Polars. These methods were compared and benchmarked for speed and efficiency, resulting in several key insights useful for those working with large data sets in R. Thereafter, we discuss the long-term implications and future developments of these findings.

Functional Outputs

In the current R ecosystem, many approaches can be used to calculate grouped counts and means. These include base R’s aggregate(), tapply(), by(), split() + lapply() mechanisms, and combinations of them. Noting that tabulate() and rowsum() give the fastest results in base R; however, the data must be prepared to account for missing values.

The analysis took us through other approaches, including dplyr (although not optimized for speed, it is popular due to its syntax), data.table, DuckDB (a high-powered query engine), collapse (a C/C++-based package), and Polars (R bindings of the famed Polars project).

Benchmarking: Speed and Performance

The blog post presented a naive benchmark, comparing the speed of each of these approaches for small to large (up to 10^8) sample sizes. Here are the observations:

‘Collapse’ was incredibly fast for all sample sizes and tasks.
DuckDB showed significant speed for large data.
Polars also showed promise
Rowsum() and tabulate() provided quick solutions in base R.

It was also noted that dplyr, data.table and rowsum() require a significant amount of memory, as does collapse::fcount().

Future Implications and Developments

The findings in the blog provide insights into the tools data scientists can consider when working with large data sets in R, depending on the size of their data and the computational resources at their disposal.

Considering the performance of DuckDB for large data, it holds investment potential as the demand for processing large data sets continues to grow. Moving forward, more efficient algorithms that work well with large data sets written in C++ like DuckDB and collapse are likely to increase in popularity and demand. Hence, these tools need more support, enhancement, and optimization to handle massive computational tasks, leading to the production of more efficient data processing systems.

It’s important to mention that the benchmarks shouldn’t be taken as definitive but as suggestive. Performance can vary based on factors such as machine specifications, data structures, and other environmental variables.

Actionable Advice

Consider ‘collapse’ for fast performance across all tasks and sample sizes.
If working with large data sizes, DuckDB is recommended due to its significant speed.
For base R users, ‘tabulate()’ and ‘rowsum()’ are efficient but make sure to account for missing values.
Keep in mind that packages like dplyr, data.table, and rowsum() can be memory-intensive.
Lastly, remember that the performance of these approaches can vary depending on the execution environment, so it is important to perform custom benchmark tests for specific tasks and machines.

Read the original article