arXiv:2407.13885v1 Announce Type: new Abstract: When implementations of the Transformer’s self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 times$, and the Softmax implementation inside the fused kernel is approximately $1.8 times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 times$ more SRAM.
Title: Maximizing Speed and Efficiency: Harnessing SRAM for Enhanced Transformer Performance
Introduction:
In a groundbreaking study, researchers have discovered a novel approach to significantly accelerate the self-attention layer of the Transformer model. By leveraging SRAM (Static Random-Access Memory) instead of DRAM (Dynamic Random-Access Memory), remarkable speedups can be achieved. This exciting development is made possible through the Tenstorrent Grayskull architecture, which boasts a grid of cores equipped with a large SRAM distributed across its framework.
The study introduces a fused kernel specifically designed for Grayskull, which exclusively harnesses the immense power of its expansive SRAM. By integrating matrix multiplication, attention score scaling, and Softmax operations, the fused kernel demonstrates unparalleled efficiency. Furthermore, the researchers present a dedicated Softmax kernel, utilizing the SRAM, and a CPU implementation serving as a baseline for comparison.
Notably, the computation of attention weights from queries and keys on Grayskull is primarily dominated by the Softmax operation. To address this, the dedicated Softmax kernel exhibits a remarkable speedup of up to times$ compared to the CPU implementation. Impressively, the Softmax implementation within the fused kernel outperforms the dedicated Softmax kernel, achieving a speed improvement of approximately .8 times$.
It is crucial to highlight that all implementations, regardless of their enhanced performance, maintain a quadratic time and memory complexity in sequence length. Additionally, the Grayskull e150, with its significantly lower cost compared to the Nvidia H100 PCIe (a state-of-the-art GPU), coupled with approximately .5 times$ more SRAM, presents an enticing option for the general public.
This groundbreaking research not only showcases the immense potential of utilizing SRAM in the Transformer’s self-attention layer but also highlights the advancements made by the Tenstorrent Grayskull architecture. With its impressive speedups and cost-effectiveness, this study paves the way for enhanced performance and accessibility in the field of deep learning.
Unlocking the Potential of SRAM: The Grayskull Architecture
When it comes to implementing the self-attention layer of the Transformer model, SRAM (Static Random Access Memory) is proving to be a game-changer. With its faster access times compared to DRAM (Dynamic Random Access Memory), SRAM opens up avenues for significant speed improvements. The Tenstorrent Grayskull architecture takes full advantage of this, offering a large SRAM distributed across a grid of cores.
In this groundbreaking work, a fused kernel designed exclusively for the Grayskull architecture is introduced. By combining matrix multiplication, attention score scaling, and Softmax operations, this fused kernel maximizes the utilization of the architecture’s large SRAM. Additionally, a dedicated Softmax kernel leveraging the SRAM, and a CPU implementation serving as a baseline, are presented.
While all operations in the self-attention layer contribute to the overall computation time, the Softmax operation stands out as a major bottleneck. It plays a crucial role in calculating attention weights from queries and keys on Grayskull. By focusing on optimizing this operation, the potential for significant speed gains becomes evident.
Dedicated Softmax Kernel: Unleashing Speed
Comparing the dedicated Softmax kernel with the CPU implementation reveals astonishing speed improvements. The dedicated Softmax kernel achieves a speedup of up to 10 times compared to the CPU implementation. This significant leap is made possible by leveraging the power of SRAM within the Grayskull architecture.
But the innovation does not stop there. The fused kernel takes the performance to a new level, making the Softmax operation approximately 1.8 times faster than the dedicated Softmax kernel. By combining multiple operations into a single kernel, the utilization of SRAM is further optimized, resulting in unparalleled efficiency.
Quadratic Complexity, Affordable Solution
It’s important to note that both the time and memory complexity of all implementations presented in this work are quadratic in sequence length. However, the introduction of the Grayskull e150 architecture offers an affordable solution for the general public.
Currently, the Grayskull e150 is approximately 30 times cheaper than the Nvidia H100 PCIe, a state-of-the-art GPU, while providing approximately 1.5 times more SRAM. This affordability combined with the speed advantages makes Grayskull a compelling choice for those looking to enhance their Transformer models.
With the Tenstorrent Grayskull architecture and its dedicated Softmax kernel, the limitations of traditional memory systems are being overcome. The utilization of SRAM unlocks unprecedented speed improvements, paving the way for more efficient and cost-effective Transformer models. As research continues to push the boundaries of machine learning, innovations like Grayskull highlight the immense potential of rethinking underlying hardware architectures.
The paper discusses the use of SRAM (Static Random Access Memory) instead of DRAM (Dynamic Random Access Memory) in the implementation of the self-attention layer of the Transformer model. The authors specifically focus on the Tenstorrent Grayskull architecture, which provides a large SRAM distributed across a grid of cores.
The researchers propose a fused kernel for Grayskull that combines matrix multiplication, attention score scaling, and Softmax operations, all of which are essential components of the self-attention mechanism. By exclusively utilizing the large SRAM of Grayskull, they aim to achieve significant speedups compared to traditional implementations that rely on DRAM.
The paper also introduces a dedicated Softmax kernel that utilizes the SRAM and a CPU implementation serving as a baseline for comparison. The Softmax operation, which calculates attention weights from queries and keys, is found to be the most time-consuming part of the computation on Grayskull.
The results demonstrate impressive speedups achieved by the dedicated Softmax kernel compared to the CPU implementation, with a maximum speedup of up to 10 times. Furthermore, the Softmax implementation inside the fused kernel is approximately 1.8 times faster than the dedicated Softmax kernel. These findings highlight the advantages of utilizing SRAM and optimizing the Softmax operation for improved performance.
It is worth noting that the time and memory complexity of all implementations discussed in the paper are quadratic in sequence length. This means that as the sequence length increases, the computational requirements also grow significantly. However, the authors do not provide further insights into potential optimizations or scalability considerations for longer sequences.
The paper concludes by highlighting the cost-effectiveness of the Grayskull e150 architecture compared to the Nvidia H100 PCIe, a state-of-the-art GPU. The Grayskull e150 is approximately 30 times cheaper for the general public and offers 1.5 times more SRAM. This cost advantage, combined with the performance improvements achieved through SRAM utilization, makes Grayskull an attractive option for researchers and practitioners working with Transformer models.
In terms of future directions, it would be interesting to see how the proposed optimizations and SRAM utilization techniques can be applied to larger-scale Transformer models and datasets. Additionally, exploring potential trade-offs between SRAM and DRAM usage, as well as investigating the impact of different attention mechanisms, could further enhance the performance and efficiency of the Grayskull architecture.
Read the original article