arXiv:2403.15226v1 Announce Type: new
Abstract: In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
Efficient Attention Skipping (EAS): Enhancing Multi-modal Large Language Models
In the field of multimedia information systems, there has been significant interest in developing more efficient and effective methods for processing large language models. These models, known as Multi-modal Large Language Models (MLLMs), have shown promise in various applications such as natural language processing, image captioning, and question answering.
One of the main computational overheads of MLLMs is the use of multi-head attentions (MHAs), which are responsible for capturing and weighing the importance of different input modalities. However, recent research has revealed that these MHAs can often be redundant or less important for downstream tasks.
In this paper, the authors propose a novel parameter and computation efficient tuning method for MLLMs, termed Efficient Attention Skipping (EAS). The core idea behind EAS is to evaluate the attention redundancy and skip the less important MHAs in order to speed up inference.
To support the attention skipping process, the authors also introduce a novel propagation-of-information adapter (PIA) that ensures parameter efficiency. This adapter can be re-parameterized into feed-forward networks (FFNs) with zero-extra latency, further optimizing the computational efficiency of the model.
The authors validate the effectiveness of EAS by applying it to two different MLLMs: LaVIN, a recently proposed model, and METER, a classic vision and language pre-trained model. They conduct extensive experiments on a set of benchmarks and evaluate the performance and speed of the models with and without EAS.
The results of the experiments demonstrate that EAS not only retains high performance and parameter efficiency but also significantly speeds up the inference process. For example, LaVIN-EAS achieves 89.98% accuracy on the ScineceQA benchmark while speeding up inference by 2.2 times compared to LaVIN without EAS.
This research showcases the multi-disciplinary nature of the concepts discussed. It combines elements from natural language processing, computer vision, and machine learning to optimize the performance of MLLMs. The efficiency gained through attention skipping and the use of propagation-of-information adapters can greatly enhance the usability of MLLMs in real-world applications.
In the wider field of multimedia information systems, techniques like Efficient Attention Skipping and the advancements made in MLLMs contribute to the development of more efficient and effective multimedia processing algorithms. These algorithms can be utilized in various multimedia applications, such as virtual reality and augmented reality systems, where the real-time processing of both textual and visual information is crucial.
Overall, this research presents a significant step forward in the optimization of MLLMs and paves the way for future advancements in the field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.