“Spectral Convolution Transformers: Enhancing Vision with Local, Global, and Long-Range Dependence

“Spectral Convolution Transformers: Enhancing Vision with Local, Global, and Long-Range Dependence

arXiv:2403.18063v1 Announce Type: cross
Abstract: Transformers used in vision have been investigated through diverse architectures – ViT, PVT, and Swin. These have worked to improve the attention mechanism and make it more efficient. Differently, the need for including local information was felt, leading to incorporating convolutions in transformers such as CPVT and CvT. Global information is captured using a complex Fourier basis to achieve global token mixing through various methods, such as AFNO, GFNet, and Spectformer. We advocate combining three diverse views of data – local, global, and long-range dependence. We also investigate the simplest global representation using only the real domain spectral representation – obtained through the Hartley transform. We use a convolutional operator in the initial layers to capture local information. Through these two contributions, we are able to optimize and obtain a spectral convolution transformer (SCT) that provides improved performance over the state-of-the-art methods while reducing the number of parameters. Through extensive experiments, we show that SCT-C-small gives state-of-the-art performance on the ImageNet dataset and reaches 84.5% top-1 accuracy, while SCT-C-Large reaches 85.9% and SCT-C-Huge reaches 86.4%. We evaluate SCT on transfer learning on datasets such as CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. We also evaluate SCT on downstream tasks i.e. instance segmentation on the MSCOCO dataset. The project page is available on this webpage.url{https://github.com/badripatro/sct}

The Multidisciplinary Nature of Spectral Convolution Transformers

In recent years, transformers have become a popular choice for various tasks in the field of multimedia information systems, including computer vision. This article discusses the advancements made in transformer architectures for vision tasks, specifically focusing on the incorporation of convolutions and spectral representations.

Transformers, originally introduced for natural language processing, have shown promising results in vision tasks as well. Vision Transformer (ViT), PVT, and Swin are some of the architectures that have improved the attention mechanism and made it more efficient. However, researchers realized that there is a need to include local information in the attention mechanism, which led to the development of CPVT and CvT – transformer architectures that incorporate convolutions.

In addition to local information, capturing global information is also crucial in vision tasks. Various methods have been proposed to achieve global token mixing, including using a complex Fourier basis. Architectures like AFNO, GFNet, and Spectformer have implemented this global mixing of information. The combination of local, global, and long-range dependence views of data has proven to be effective in improving performance.

In this article, the focus is on investigating the simplest form of global representation – the real domain spectral representation obtained through the Hartley transform. By using a convolutional operator in the initial layers, local information is captured. These two contributions have led to the development of a new transformer architecture called Spectral Convolution Transformer (SCT).

SCT has shown improved performance over state-of-the-art methods while also reducing the number of parameters. The results on the ImageNet dataset are impressive, with SCT-C-small achieving 84.5% top-1 accuracy, SCT-C-Large reaching 85.9%, and SCT-C-Huge reaching 86.4%. The authors have also evaluated SCT on transfer learning tasks using datasets like CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. Additionally, SCT has been tested on downstream tasks such as instance segmentation on the MSCOCO dataset.

The multidisciplinary nature of this research is noteworthy. It combines concepts from various fields such as computer vision, artificial intelligence, information systems, and signal processing. By integrating convolutions and spectral representations into transformers, the authors have pushed the boundaries of what transformers can achieve in vision tasks.

As multimedia information systems continue to evolve, the innovations in transformer architectures like SCT open up new possibilities for advancements in animations, artificial reality, augmented reality, and virtual realities. These fields heavily rely on efficient and effective processing of visual data, and transformer architectures have the potential to revolutionize how these systems are developed and utilized.

In conclusion, the introduction of spectral convolution transformers is an exciting development in the field of multimedia information systems. The combination of convolutions and spectral representations allows for the incorporation of local, global, and long-range dependence information, leading to improved performance and reduced parameters. Further exploration and application of these architectures hold great promise for multimedia applications such as animations, artificial reality, augmented reality, and virtual realities.

References:

  • ViT: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
  • PVT: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
  • Swin: Hierarchical Swin Transformers for Long-Tail Vision Tasks
  • CPVT: Convolutions in Transformers: Visual Recognition with Transformers and Convolutional Operations
  • CvT: CvT: Introducing Convolutions to Vision Transformers
  • AFNO: Attention-based Fourier Neural Operator for Nonlinear Partial Differential Equations
  • GFNet: Gather and Focus: QA with Context Attributes and Interactions
  • Spectformer: SpectFormer: Unifying Spectral and Spatial Self-Attention for Multimodal Learning

Read the original article

“Optimizing RF Receiver Performance with Circuit-centric Genetic Algorithm”

“Optimizing RF Receiver Performance with Circuit-centric Genetic Algorithm”

This paper presents a highly efficient method for optimizing parameters in analog/high-frequency circuits, specifically targeting the performance parameters of a radio-frequency (RF) receiver. The goal is to maximize the receiver’s performance by reducing power consumption and noise figure while increasing conversion gain. The authors propose a novel approach called the Circuit-centric Genetic Algorithm (CGA) to address the limitations observed in the traditional Genetic Algorithm (GA).

One of the key advantages of the CGA is its simplicity and computational efficiency compared to existing deep learning models. Deep learning models often require significant computational resources and extensive training data, which may not always be readily available in the context of analog/high-frequency circuit optimization. The CGA, on the other hand, offers a simpler inference process that can more effectively leverage available circuit parameters to optimize the performance of the RF receiver.

Furthermore, the CGA offers significant advantages over manual design and the conventional GA in terms of finding optimal points. Manual design can be a time-consuming and iterative process, requiring the designer to experiment with various circuit parameters to identify the best combination. The conventional GA, while automated, can still be computationally expensive and may not always guarantee finding the superior optimum points. The CGA, with its circuit-centric approach, aims to mitigate the designer’s workload by automating the search for the best parameter values while also enhancing the likelihood of finding superior optimum points.

Looking ahead, it would be interesting to see the CGA being applied to more complex analog/high-frequency circuits beyond RF receivers. The authors demonstrate the feasibility of the method in optimizing a receiver, but its potential application in other circuit types could greatly benefit the field. Additionally, future research could explore the combination of CGA with other optimization techniques, further enhancing its efficiency and effectiveness in tuning circuit parameters.

Read the original article

“New Approach for Multi-Sound Source Localization Without Prior Knowledge”

“New Approach for Multi-Sound Source Localization Without Prior Knowledge”

arXiv:2403.17420v1 Announce Type: new
Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL

Expert Commentary: Advancements in Multi-Sound Source Localization

Multi-sound source localization is a crucial task in the field of multimedia information systems, as it enables the identification and localization of sound sources in a given environment. The ability to accurately localize sound sources has wide-ranging applications, including audio scene analysis, surveillance systems, and virtual reality experiences.

The mentioned article introduces a novel method for multi-sound source localization that overcomes the limitation of requiring prior knowledge about the number of sound sources to be separated. This is a significant advancement, as it allows for more flexible and adaptable localization in real-world scenarios where prior information is often unavailable.

One notable feature of the proposed method is the iterative object identification (IOI) module. This module leverages an iterative approach to identify sound-making objects in the mixture. By iteratively refining the object identification process, the method can improve the accuracy of localization without the need for prior knowledge. This iterative approach is a testament to the multi-disciplinary nature of this research, combining concepts from signal processing, machine learning, and computer vision.

To further enhance the accuracy of localization, the authors introduce the object similarity-aware clustering (OSC) loss. This loss function guides the IOI module to effectively combine regions of the same object while also distinguishing between different objects and backgrounds. By incorporating object similarity awareness into the clustering process, the proposed method achieves better discrimination and localization performance.

The experimental results on the MUSIC and VGGSound benchmarks demonstrate the significant performance improvements of the proposed method over existing methods for both single and multi-source localization. This suggests that the method can accurately identify and localize sound sources in various scenarios, making it suitable for real-world applications.

In the wider field of multimedia information systems, the advancements in multi-sound source localization have implications for the fields of animations, artificial reality, augmented reality, and virtual realities. Accurate localization of sound sources in these contexts can greatly enhance the immersive experiences and realism of multimedia content. For example, in virtual reality applications, precise localization of virtual sound sources can create a more realistic and engrossing environment for users.

In conclusion, the proposed method for multi-sound source localization without prior knowledge in the mentioned article showcases the continual progress in the field of multimedia information systems. The multi-disciplinary nature of this research, alongside the significant performance improvements, paves the way for enhanced multimedia experiences in various domains, including animations, artificial reality, augmented reality, and virtual realities.

Read the original article

“Evaluating AI Systems in Medicine Without Ground-Truth Annotations: Introducing the SUDO Framework”

“Evaluating AI Systems in Medicine Without Ground-Truth Annotations: Introducing the SUDO Framework”

Artificial intelligence (AI) systems are being increasingly used in the medical field to assist in diagnosis and treatment decisions. However, one of the challenges in evaluating the performance of these AI systems is the lack of ground-truth annotations in real-world data. This means that when the AI system is deployed in a clinical setting and encounters data that is different from the data it was trained on, it may not perform as expected.

In this article, the authors introduce a framework called SUDO, which stands for Supervised to Unsupervised Data Optimization. SUDO addresses the issue of evaluating AI systems without ground-truth annotations by assigning temporary labels to data points in the wild. The temporary labels are then used to train multiple models, and the model with the highest performance is considered to have the most likely label.

The authors conducted experiments using AI systems developed for dermatology images, histopathology patches, and clinical reports. They found that SUDO can reliably assess model performance and identify unreliable predictions. By triaging unreliable predictions for further inspection, SUDO can help improve the integrity of research findings and the deployment of ethical AI systems in medicine.

One of the key benefits of SUDO is its ability to assess algorithmic bias in AI systems without ground-truth annotations. Algorithmic bias, where an AI system produces unfair or discriminatory outcomes, is a growing concern in healthcare. By using SUDO to evaluate algorithmic bias, researchers and developers can gain insights into potential biases in AI systems and take steps to address them.

This framework has the potential to significantly enhance the evaluation and deployment of AI systems in the medical field. By providing a reliable proxy for model performance and enabling the assessment of algorithmic bias, SUDO can help ensure the safety, reliability, and ethical use of AI systems in healthcare.

Read the original article

Efficient Network-Assisted Video Streaming for High-Resolution Content

Efficient Network-Assisted Video Streaming for High-Resolution Content

arXiv:2403.16951v1 Announce Type: new
Abstract: Multimedia applications, mainly video streaming services, are currently the dominant source of network load worldwide. In recent Video-on-Demand (VoD) and live video streaming services, traditional streaming delivery techniques have been replaced by adaptive solutions based on the HTTP protocol. Current trends toward high-resolution (e.g., 8K) and/or low-latency VoD and live video streaming pose new challenges to end-to-end (E2E) bandwidth demand and have stringent delay requirements. To do this, video providers typically rely on Content Delivery Networks (CDNs) to ensure that they provide scalable video streaming services. To support future streaming scenarios involving millions of users, it is necessary to increase the CDNs’ efficiency. It is widely agreed that these requirements may be satisfied by adopting emerging networking techniques to present Network-Assisted Video Streaming (NAVS) methods. Motivated by this, this thesis goes one step beyond traditional pure client-based HAS algorithms by incorporating (an) in-network component(s) with a broader view of the network to present completely transparent NAVS solutions for HAS clients.
Expert Commentary:

This article discusses the challenges faced by multimedia applications, specifically video streaming services, in terms of network load and delivery techniques. With the increasing popularity of high-resolution and low-latency video streaming, there is a need to ensure sufficient bandwidth and minimize delays. Content Delivery Networks (CDNs) have been utilized to support these streaming scenarios and provide scalable video streaming services.

However, as the demand for streaming services continues to grow and involve millions of users, CDNs need to become more efficient. This is where the concept of Network-Assisted Video Streaming (NAVS) methods comes into play. By incorporating in-network components with a broader view of the network, NAVS solutions can enhance the performance of HTTP-based adaptive streaming (HAS) algorithms used by clients.

The multi-disciplinary nature of this concept lies in the combination of networking techniques and multimedia information systems. It is not just about optimizing delivery techniques, but also considering the overall network infrastructure to improve the quality of video streaming services.

This article highlights the importance of adopting emerging networking techniques and implementing NAVS solutions to address the bandwidth and delay requirements of modern video streaming services. It is a step forward in the evolution of multimedia systems, as it combines the fields of networking, multimedia, and information systems.

In relation to animations, artificial reality, augmented reality, and virtual realities, the concept of NAVS can play a significant role in enhancing the delivery of multimedia content in these scenarios. As these technologies heavily rely on real-time and high-quality streaming, optimizing the network infrastructure through NAVS solutions can greatly improve the overall user experience.

Overall, the article brings attention to the need for efficient content delivery in multimedia applications and proposes the adoption of NAVS methods as a solution. By incorporating networking techniques and considering the wider context of the network, it aims to improve video streaming services and meet the growing demands of the industry.
Read the original article

Detecting Psychological Stressors in Persian Tweets: A Capsule Based Approach

Detecting Psychological Stressors in Persian Tweets: A Capsule Based Approach

Analysis of the Study on Detecting Psychological Stress from Persian Tweets

As an expert commentator, I will delve into the details and implications of a study that focuses on detecting psychological stress related to suicide from Persian tweets using learning based methods. The study highlights the significance of identifying psychological stressors in an at-risk population, as it can potentially contribute to early prevention of suicidal behaviors.

The researchers acknowledge the growing popularity and widespread use of social media platforms, particularly Twitter, as a means of real-time information sharing. This provides a unique opportunity for early intervention and detection of psychological stressors in both large and small populations. However, most of the existing research in this area has focused on non-Persian languages, thereby limiting the applicability of the findings to Persian-speaking individuals.

The proposed approach in this study utilizes a capsule-based method to extract and classify psychological stressors from Persian tweets. Capsule networks have shown promise in various natural language processing tasks, and their application in this context can potentially yield valuable insights.

The results of the study reveal a binary classification accuracy of 0.83, indicating that the capsule-based approach is effective in detecting psychological stress related to suicide in Persian tweets. This level of accuracy is promising and suggests the potential usefulness of machine learning techniques in identifying individuals at risk of suicidal tendencies.

By training the model on a large dataset of Persian tweets, the researchers have been able to achieve a relatively high accuracy in detecting psychological stress. This highlights the importance of utilizing a comprehensive and diverse dataset to develop robust machine learning models.

Further research in this area could focus on refining the capsule-based approach and exploring additional linguistic features specific to Persian tweets that could enhance the accuracy of the classification. Additionally, investigating the generalizability of the model to other Persian-speaking populations in different cultural contexts would be a valuable direction for future studies.

In conclusion, this study demonstrates the potential of utilizing learning based methods, specifically capsule networks, to detect psychological stress from Persian tweets. The findings contribute to the field of suicide prevention by highlighting the importance of early intervention and leveraging social media platforms for identifying individuals at risk of suicidal behaviors. Further research is needed to refine and expand upon these techniques for better detection and prevention of suicide in Persian-speaking populations.

Read the original article