**shikokuchuo{net}**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

#### A surprise

I came to write this post because I was surprised by its

findings.

The seed for it came from a somewhat obscure source: the Tokyo.R

slack channel.

This is actually a fairly vibrant R community. In any case, there was

a post by an R user suprised to find parallel computation much slower

than the sequential alternative – even though he had thousands, tens of

thousands of ‘embarassingly parallel’ iterations to compute.

He demonstrated this with a simple benchmarking exercise, which

showed a variety of parallel map functions from various packages (which

shall remain nameless), each slower than the last, and (much) slower

than the non-parallel versions.

The replies to this post could be anticipated, and mostly aimed to

impart some of the received wisdom: namely that you need computations to

be sufficiently complex to benefit from parallel processing, due to the

extra overhead from sending and coordinating information to and from

workers. For simple functions, it is just not worth the effort.

And this is indeed the ‘received wisdom’…

and I thought about it… and the benchmarking results continued to

look really embarassing.

The implicit answer was just not particularly satisfying:

‘sometimes it works, you just have to judge when’.

And it didn’t really answer the original poster either – for he just

attemped to expose the problem by using a simple example, not that his

real usage was as simple.

The parallel methods just didn’t work. Or rather didn’t ‘just

work’^{TM}.

And this is what sparked off The Investigation.

#### The Investigation

It didn’t seem right that there should be such a high bar before

parallel computations become beneficial in R.

My starting point would be mirai, somewhat naturally (as

I’m the author). I also knew that `mirai`

would be fast, as

it was designed to be minimalist.

`mirai`

? That’s みらい or Japanese for ‘future’. All you

need to know for now is that it’s a package that can create its own

special type of parallel clusters.

I had not done such a benchmarking exercise before as performance

itself was not its *raison d’être*. More than anything else, it

was built as a reliable scheduler for distributed computing. It is the

engine that powers crew, the high

performance computing element of targets, where it

is used in industrial-scale reproducible pipelines.

And this is what I found:

##### Applying

the statistical function `rpois()`

over 10,000

iterations:

library(parallel) library(mirai) base <- parallel::makeCluster(4) mirai <- mirai::make_cluster(4) x <- 1:10000 res <- microbenchmark::microbenchmark( parLapply(base, x, rpois, n = 1), lapply(x, rpois, n = 1), parLapply(mirai, x, rpois, n = 1) ) ggplot2::autoplot(res) + ggplot2::theme_minimal()

Using the ‘mirai’ cluster resulted in faster results than the simple

non-parallel `lapply()`

, which was then in turn much faster

than the base default parallel cluster.

Faster!

I’m only showing the comparison with base R functions. They’re often

the most performant after all. The other packages that had featured in

the original benchmarking suffer from an even greater overhead than that

of base R, so there’s little point showing them above.

Let’s confirm with an even simpler function…

##### Applying

the base function `sum()`

over 10,000 iterations:

res <- microbenchmark::microbenchmark( parLapply(base, x, sum), lapply(x, sum), parLapply(mirai, x, sum) ) ggplot2::autoplot(res) + ggplot2::theme_minimal()

`mirai`

holds its own! Not much faster than sequential,

but not slower either.

But what if the data being transmitted back and forth is larger,

would that make a difference? Well, let’s change up the original

`rpois()`

example, but instead of iterating over lamba, have

it return increasingly large vectors instead.

##### Applying

the statistical function `rpois()`

to generate random vectors

around length 10,000:

x <- 9900:10100 res <- microbenchmark::microbenchmark( parLapplyLB(base, x, rpois, lambda = 1), lapply(x, rpois, lambda = 1), parLapplyLB(mirai, x, rpois, lambda = 1) ) ggplot2::autoplot(res) + ggplot2::theme_minimal()

The advantage is maintained! ^{1}

So ultimately, what does this all mean?

Well, quite significantly, that virtually any place you have

‘embarassingly parallel’ code where you would use `lapply()`

or `purrr::map()`

, you can now confidently replace with a

parallel `parLapply()`

using a ‘mirai’ cluster.

The answer is **no longer** *‘sometimes it works, you
just have to judge when’*, but:

‘yes, it works!’.

#### What is this Magic

`mirai`

uses the latest NNG (Nanomsg Next Generation)

technology, a lightweight messaging library and concurrency framework ^{2} – which

means that the communications layer is so fast that this no longer

creates a bottleneck.

The package leverages new connection types such as IPC (inter-process

communications), that are not available to base R. As part of R Project

Sprint 2023, R Core invited participants to provide alternative

commnunications backends for the `parallel`

package, and

‘mirai’ clusters were born as a result.

A ‘mirai’ cluster is simply another type of ‘parallel’ cluster, and

are persistent background processes utilising cores on your own machine,

or on other machines across the network (HPCs or even the cloud).

I’ll leave it here for this post. You’re welcome to give

`mirai`

a try, it’s available on CRAN and at https://github.com/shikokuchuo/mirai.

**leave a comment**for the author, please follow the link and comment on their blog:

**shikokuchuo{net}**.

R-bloggers.com offers **daily e-mail updates** about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

**Continue reading**: mirai Parallel Clusters

## Understanding Parallel Computing and Future Developments in R

This post is a follow up on an informative article from R-bloggers which illustrated that parallel computing is sometimes mistakenly assumed to be slower than sequential computing. The article presented some insightful benchmark tests and threw light upon how a tool called mirai performs better. The key argument presented in the article is the case with simple or ’embarrassingly parallel’ code, which are functions with iterations or loops that can be executed simultaneously or independently. It points out that parallel computations in such cases can be speedier than non-parallel ones, contrary to what many believe.

## Long-term Implications

The value of understanding when and how to leverage parallel computing is enormous. As computational workload spikes, whether due to more complex analysis demands or increases in dataset size, the use of parallel processing will become more and more common.

Especially, the part regular users of the mainstream R system, such as researchers, data scientists, and business analysts, will find being equipped with such knowledge and tools of tremendous value. They can significantly cut down the time it takes to process large datasets and complex calculations by properly leveraging parallel computing resources and tools like mirai.

The development and acceptance of mirai-like packages will also incentivise similar intuitive, user-friendly software packages that democratise access to high-performance computing.

### Potential Future Developments

- Mirai could prompt updates in the default R system to optimise for parallel computations as it has evidently showcased its speed in benchmark tests.
- Mirai’s rise to broader usage could inspire the development of more specific packages leveraging parallelism for specific computations.
- These learnings may further feed into the development of intelligent systems capable of deciding when best to leverage parallel computation, optimising efficiency, and user experience.

## Actionable Advice

Users who often compute ’embarrassingly parallel’ tasks should try and harness the power of mirai.

- Start by installing and using the package via CRAN or GitHub for parallel processing in lieu of lapply() or purr::map().
- Keep an eye on the evolution of tools like mirai and adapt your workflows accordingly. As software continues to evolve, so must our best practices.
- Take some time to understand when it is best to make use of parallel versus sequential computing. The article gives us some insights; a rule of thumb is to look at the complexity and size of computation job at hand.

### Notes

A benchmarking exercise in the original article showed that the mirai package resulted in faster computations than both the default parallel cluster in R and non-parallel computations for execution of a simple function over 10,000 iterations.

The mirai package benefits from NNG (Nanomsg Next Generation) technology, a high-performance messaging library and concurrency framework. It applies new connection types such as IPC (inter-process communications), not available to base R.