arXiv:2403.18063v1 Announce Type: cross
Abstract: Transformers used in vision have been investigated through diverse architectures – ViT, PVT, and Swin. These have worked to improve the attention mechanism and make it more efficient. Differently, the need for including local information was felt, leading to incorporating convolutions in transformers such as CPVT and CvT. Global information is captured using a complex Fourier basis to achieve global token mixing through various methods, such as AFNO, GFNet, and Spectformer. We advocate combining three diverse views of data – local, global, and long-range dependence. We also investigate the simplest global representation using only the real domain spectral representation – obtained through the Hartley transform. We use a convolutional operator in the initial layers to capture local information. Through these two contributions, we are able to optimize and obtain a spectral convolution transformer (SCT) that provides improved performance over the state-of-the-art methods while reducing the number of parameters. Through extensive experiments, we show that SCT-C-small gives state-of-the-art performance on the ImageNet dataset and reaches 84.5% top-1 accuracy, while SCT-C-Large reaches 85.9% and SCT-C-Huge reaches 86.4%. We evaluate SCT on transfer learning on datasets such as CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. We also evaluate SCT on downstream tasks i.e. instance segmentation on the MSCOCO dataset. The project page is available on this webpage.url{https://github.com/badripatro/sct}

The Multidisciplinary Nature of Spectral Convolution Transformers

In recent years, transformers have become a popular choice for various tasks in the field of multimedia information systems, including computer vision. This article discusses the advancements made in transformer architectures for vision tasks, specifically focusing on the incorporation of convolutions and spectral representations.

Transformers, originally introduced for natural language processing, have shown promising results in vision tasks as well. Vision Transformer (ViT), PVT, and Swin are some of the architectures that have improved the attention mechanism and made it more efficient. However, researchers realized that there is a need to include local information in the attention mechanism, which led to the development of CPVT and CvT – transformer architectures that incorporate convolutions.

In addition to local information, capturing global information is also crucial in vision tasks. Various methods have been proposed to achieve global token mixing, including using a complex Fourier basis. Architectures like AFNO, GFNet, and Spectformer have implemented this global mixing of information. The combination of local, global, and long-range dependence views of data has proven to be effective in improving performance.

In this article, the focus is on investigating the simplest form of global representation – the real domain spectral representation obtained through the Hartley transform. By using a convolutional operator in the initial layers, local information is captured. These two contributions have led to the development of a new transformer architecture called Spectral Convolution Transformer (SCT).

SCT has shown improved performance over state-of-the-art methods while also reducing the number of parameters. The results on the ImageNet dataset are impressive, with SCT-C-small achieving 84.5% top-1 accuracy, SCT-C-Large reaching 85.9%, and SCT-C-Huge reaching 86.4%. The authors have also evaluated SCT on transfer learning tasks using datasets like CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. Additionally, SCT has been tested on downstream tasks such as instance segmentation on the MSCOCO dataset.

The multidisciplinary nature of this research is noteworthy. It combines concepts from various fields such as computer vision, artificial intelligence, information systems, and signal processing. By integrating convolutions and spectral representations into transformers, the authors have pushed the boundaries of what transformers can achieve in vision tasks.

As multimedia information systems continue to evolve, the innovations in transformer architectures like SCT open up new possibilities for advancements in animations, artificial reality, augmented reality, and virtual realities. These fields heavily rely on efficient and effective processing of visual data, and transformer architectures have the potential to revolutionize how these systems are developed and utilized.

In conclusion, the introduction of spectral convolution transformers is an exciting development in the field of multimedia information systems. The combination of convolutions and spectral representations allows for the incorporation of local, global, and long-range dependence information, leading to improved performance and reduced parameters. Further exploration and application of these architectures hold great promise for multimedia applications such as animations, artificial reality, augmented reality, and virtual realities.

References:

  • ViT: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
  • PVT: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
  • Swin: Hierarchical Swin Transformers for Long-Tail Vision Tasks
  • CPVT: Convolutions in Transformers: Visual Recognition with Transformers and Convolutional Operations
  • CvT: CvT: Introducing Convolutions to Vision Transformers
  • AFNO: Attention-based Fourier Neural Operator for Nonlinear Partial Differential Equations
  • GFNet: Gather and Focus: QA with Context Attributes and Interactions
  • Spectformer: SpectFormer: Unifying Spectral and Spatial Self-Attention for Multimodal Learning

Read the original article