Pipeline parallelism is one of the key components for large-scale distributed
training, yet its efficiency suffers from pipeline bubbles which were deemed
inevitable. In this work, we introduce a scheduling strategy that, to our
knowledge, is the first to successfully achieve zero pipeline bubbles under
synchronous training semantics. The key idea behind this improvement is to
split the backward computation into two parts, one that computes gradient for
the input and another that computes for the parameters. Based on this idea, we
handcraft novel pipeline schedules that significantly outperform the baseline
methods. We further develop an algorithm that automatically finds an optimal
schedule based on specific model configuration and memory limit. Additionally,
to truly achieve zero bubble, we introduce a novel technique to bypass
synchronizations during the optimizer step. Experimental evaluations show that
our method outperforms the 1F1B schedule up to 23% in throughput under a
similar memory limit. This number can be further pushed to 31% when the memory
constraint is relaxed. We believe our results mark a major step forward in
harnessing the true potential of pipeline parallelism. We open sourced our
implementation based on the popular Megatron-LM repository on
https://github.com/sail-sg/zero-bubble-pipeline-parallelism.
Pipeline parallelism is an integral part of large-scale distributed training, but its efficiency has been hindered by pipeline bubbles that were considered unavoidable. However, this new research introduces a scheduling strategy that achieves zero pipeline bubbles when using synchronous training semantics. This is a significant breakthrough in the field.
The authors propose splitting the backward computation into two parts: one that calculates the gradient for the input and another that computes the gradient for the parameters. By doing so, they were able to design novel pipeline schedules that outperform existing methods by a significant margin.
Moreover, the researchers have developed an algorithm that automatically finds an optimal schedule based on the model configuration and memory limit. This means that practitioners can leverage this approach without needing to manually fine-tune the scheduling strategy for their specific setup.
In order to truly achieve zero pipeline bubbles, a novel technique to bypass synchronizations during the optimizer step is introduced. This further improves the efficiency of the pipeline parallelism approach.
The experimental evaluations clearly demonstrate the superiority of this new method. It outperforms the 1F1B schedule by up to 23% in throughput, even under a similar memory constraint. When the memory constraint is relaxed, this improvement increases to an impressive 31%. These results showcase the potential of pipeline parallelism, and this research represents a significant step forward in harnessing its true power.
The interdisciplinary nature of this work is worth highlighting. It combines concepts from distributed systems, deep learning, and optimization algorithms. The development of a new scheduling strategy, algorithm, and bypass technique required expertise from these multiple disciplines. This multi-disciplinary approach is becoming increasingly important as researchers seek to push the boundaries of various technological domains.
To foster collaboration and further advance the field, the authors have open-sourced their implementation based on the popular Megatron-LM repository on GitHub. This allows other researchers and practitioners to access and build upon their work, fostering innovation and progress in the field of pipeline parallelism.