[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

The aggregate function can be very useful in R, allowing one to run a function (e.g. mean) within groups of rows, in each column in a matrix/data-frame and organize the results in an easy-to-read table. However, the function takes long to run for very wide matrices and data frames, where the number of the columns are large. I this post I demonstrate the issue and show a couple of nice solutions that at least for the example cuts down the time to 15% and even less, compared to the run-time of the aggregate function.

 I first created a wide matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using normal distribution. 

# The necessity to avoid wide matrices (with lots of columns)!
matWide= matrix(rnorm(1e+06),nrow=100, ncol=10000)

# Transform matrix to data frame
dfWide=as.data.frame(matWide)

I used the aggregate function to take the mean within groups of rows, for each column !  I realized that the aggregate function takes about 4 seconds to run.

t1=Sys.time()
aggRes=aggregate(dfWide, list(rep(1:10, each=10)), mean)
(timeDifAggr=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 3.807029 secs

Here is the fist 5 columns and rows of the result data frame and its dimensions.

aggRes[1:5,1:5]
#  Group.1           V1          V2         V3          V4
#1       1  0.008815372  0.56920407  0.2195522  0.68183883
#2       2  0.046319580  0.07915253  0.2732586  0.30970451
#3       3  0.154718798 -0.09157008 -0.3676212 -0.02970137
#4       4  0.491208585  0.53066464 -0.1407269  0.49633703
#5       5 -0.397868879 -0.09793382  0.4154764 -0.17150871

dim(aggRes)
#[1]    10 10001
 

Then I used a nested ‘apply’ function (technically a tapply inside an apply function) approach to run the same analysis. It took significantly less time (about half a second).

t1=Sys.time()
nestApplyRes=apply(dfWide, 2, function(x){
  return(tapply(x, rep(1:10, each=10), mean))})
nestApplyRes=data.frame(Group.1=rownames(nestApplyRes),
                        nestApplyRes)
(timeDifNest=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.5010331 secs

#Check if it provides exactly the same result as aggregate
all(aggRes==nestApplyRes)
#[1] TRUE

Eventually, I used the data tables as it has been suggested by few in some forums. It took even less time to run; about 0.26 second. 

 
library(data.table)
t1=Sys.time()
#Convert to data.table and compute means in column-major order (like aggregate)
dtRes <- as.data.table(dfWide)[, lapply(.SD, function(x) mean(x)), by = .(Group.1 = rep(1:10, each = 10))]
dtRes=as.data.frame(dtRes)
(timeDifDt=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.268255 secs

all(aggRes==dtRes)
#TRUE

I also plotted the run time of each of the approaches!

 
jpeg(“TimeDif.jpg”, res=300, width=800, height=800)
par(mar = c(6.5, 2.5, 1.5, 0.5))
barplot(height = as.numeric(c(timeDifAggr, timeDifNest, timeDifDt)),
        names.arg =c(“Aggregate”, “Nested apply”, “Data table”),
        las=2 , ylim=c(0,4), col=heat.colors(3), ylab=”Sec”)
dev.off()

So now I’ll think twice before using the aggregate function 😒.

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !

Efficiency of Aggregate Function in R

Aggregate functions in R offer a suitable way to run functions within groups of rows for each column in a matrix or data frame. Its use provides well-organized results, making it easier to interpret reports. However, it takes a longer time to process wide matrices and data frames with several columns. A test demonstration shows that the function took about 4 seconds to process a matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using a normal distribution. There are alternative solutions that could significantly reduce this runtime.

Alternative Solutions

The Nested ‘Apply’ Approach

The nested ‘apply’ approach which practically involves a ‘tapply’ inside an ‘apply’ function can speed up the same analysis. In the test example, it took significantly less time, halving the original time to about a half second. This process gave the same result when compared to the aggregate function, bringing only time efficiency without errors or alteration in results.

The Data Table Approach

Using data tables as suggested within some forums improves processing time even more. The processing of the same data took even less time, about 0.26 seconds in this case. This presents a significant improvement in time efficiency in the processing of wide matrices and data frames.

Long-term implications and Future Developments

These alternative ways of processing wide matrices and data frames in R could have substantial time efficiency benefits in the long run. Reduced processing time can increase productivity, especially in an environment that requires regular and continuous data dealings. For large datasets often used in machine learning and data science projects, these methods prolong the lifespan of hardware by reducing wear and tear resulting from processing large chunks of data.

In the future, further efficient methods or improvements on existing ones may emerge. This would universally improve the productivity in data analysis across various sectors of society. For students, researchers, and professionals who frequently deal with large datasets, this means more simplicity and time management advantages. It also implies a continuous culture of seeking advancements in current technologies and techniques.

Actionable Advice

  • Explore more efficient ways: Take time to discover and explore more efficient methods of processing large data in R. This could significantly reduce your processing time and improve productivity.
  • Keep up with developments: Stay updated on recent developments and improvements on existing methods. The dynamic world of data analysis means that newer and better ways can emerge at any time.
  • Share your findings: If you stumble upon a better way of doing things, share your findings with others. This could help improve the productivity of others.
  • Embrace alternative methods: Don’t be stuck on one way of performing an operation. Being versatile with several ways of doing things can help you adapt to different situations at different times.

Efficiency in data processing is not just about obtaining the correct results; it’s also about doing so in the shortest time possible!

Read the original article