Analysis: Structured Neuron-level Pruning for Vision Transformers

The article discusses the challenges faced by Vision Transformers (ViTs) in terms of computational cost and memory footprint, which make it difficult to deploy them on devices with limited resources. While conventional pruning approaches can compress and accelerate the Multi-head self-attention (MSA) module in ViTs, they do not take into account the structure of the MSA module.

In response to this, the proposed method, Structured Neuron-level Pruning (SNP), is introduced. SNP aims to prune neurons with less informative attention scores and eliminate redundancy among heads. This is achieved by pruning graphically connected query and key layers with the least informative attention scores, while preserving the overall attention scores. Value layers, on the other hand, can be pruned independently to reduce inter-head redundancy.

The results of applying SNP to Transformer-based models are promising. For example, the DeiT-Small model with SNP runs 3.1 times faster than the original model while achieving 21.94% faster performance and 1.12% higher accuracy than the DeiT-Tiny model. Additionally, SNP can be combined with conventional head or block pruning approaches, resulting in significant parameter and computational cost reduction and faster inference speeds on different hardware platforms.

Overall, SNP presents a novel approach to compressing and accelerating Vision Transformers by considering the structure of the MSA module. By selectively pruning neurons and eliminating redundancy, SNP offers a promising solution to make ViTs more suitable for deployment on edge devices with limited resources, as well as improving performance on server processors.

Expert Insights:

As an expert in the field, I find the proposed SNP method to be a valuable contribution to the optimization of Vision Transformers. The use of structured neuron-level pruning, which takes into account the graph connections within the MSA module, helps to identify and remove redundant information while preserving overall attention scores. This not only leads to significant computational cost reduction but also improves inference speed without sacrificing performance.

The results presented, such as the 3.1 times faster inference speed of DeiT-Small with SNP compared to the original model, demonstrate the effectiveness of the proposed method. Moreover, the successful combination of SNP with head or block pruning approaches further highlights its versatility and potential for even greater compression and speed improvements.

With the increasing demand for deploying vision models on edge devices and the need for efficient use of server processors, techniques like SNP are crucial for making Vision Transformers more practical and accessible. The ability to compress and accelerate such models without compromising their performance opens up new possibilities for a wide range of applications, including real-time computer vision tasks and resource-constrained scenarios.

I believe that the SNP method has the potential to inspire further research in pruning techniques for Vision Transformers, which can lead to the development of more optimized and efficient models. Additionally, future work could explore the application of SNP to other attention-based models or investigate the impact of different pruning strategies on specific vision tasks to identify the most effective combinations.

Overall, the proposed SNP method addresses the challenges of computational cost and memory footprint in Vision Transformers by leveraging structured neuron-level pruning. This approach shows promising results in terms of speed improvement and parameter reduction, making ViTs more suitable for deployment on resource-constrained devices while maintaining or even enhancing performance.
Read the original article