“Introducing ConvBench: A New Benchmark for Evaluating Large Vision-Language Models in Multi-T

“Introducing ConvBench: A New Benchmark for Evaluating Large Vision-Language Models in Multi-T

arXiv:2403.20194v1 Announce Type: new
Abstract: This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on a distinct capability, mirroring the cognitive progression from basic perception to logical reasoning and ultimately to advanced creativity. ConvBench comprises 577 meticulously curated multi-turn conversations encompassing 215 tasks reflective of real-world demands. Automatic evaluations quantify response performance at each turn and overall conversation level. Leveraging the capability hierarchy, ConvBench enables precise attribution of conversation mistakes to specific levels. Experimental results reveal a performance gap between multi-modal models, including GPT4-V, and human performance in multi-turn conversations. Additionally, weak fine-grained perception in multi-modal models contributes to reasoning and creation failures. ConvBench serves as a catalyst for further research aimed at enhancing visual dialogues.

ConvBench: A Multi-Turn Conversation Evaluation Benchmark for Large Vision-Language Models

In the field of multimedia information systems, the development of Large Vision-Language Models (LVLMs) has gained significant attention. These models are designed to understand and generate text while also incorporating visual information. ConvBench, a novel benchmark presented in this paper, focuses on evaluating the performance of LVLMs in multi-turn conversations.

Unlike existing benchmarks that assess the capabilities of models in single-turn dialogues, ConvBench takes a multi-level approach. It mimics the cognitive processes of humans by dividing the evaluation into three levels: perception, reasoning, and creativity. This multi-modal capability hierarchy allows for a more comprehensive assessment of LVLM performance.

ConvBench comprises 577 carefully curated multi-turn conversations, covering 215 real-world tasks. Each conversation is automatically evaluated at every turn, as well as at the overall conversation level. This precise evaluation enables researchers to attribute mistakes to specific levels, facilitating a deeper understanding of model performance.

The results of experiments conducted using ConvBench highlight a performance gap between multi-modal models, including GPT4-V, and human performance in multi-turn conversations. This suggests that there is still room for improvement in LVLMs, particularly in the area of weak fine-grained perception, which contributes to failures in reasoning and creativity.

The concepts presented in ConvBench have far-reaching implications in the wider field of multimedia information systems. By incorporating both visual and textual information, LVLMs have the potential to revolutionize various applications such as animations, artificial reality, augmented reality, and virtual reality. These technologies heavily rely on the seamless integration of visuals and language, and ConvBench provides a benchmark for evaluating and improving the performance of LVLMs in these domains.

Furthermore, the multi-disciplinary nature of ConvBench, with its combination of perception, reasoning, and creativity, highlights the complex cognitive processes involved in human conversation. By studying and enhancing these capabilities in LVLMs, researchers can advance the field of artificial intelligence and develop models that come closer to human-level performance in engaging and meaningful conversations.

Conclusion

ConvBench is a pioneering multi-turn conversation evaluation benchmark that provides deep insights into the performance of Large Vision-Language Models. With its multi-modal capability hierarchy and carefully curated conversations, ConvBench enables precise evaluation and attribution of errors. The results of ConvBench experiments reveal the existing performance gap and the need for improvement in multi-modal models. The concepts presented in ConvBench have significant implications for multimedia information systems, animations, artificial reality, augmented reality, and virtual reality. By advancing LVLMs, researchers can pave the way for more engaging and meaningful interactions between humans and machines.

Read the original article

“Exploring the Role of Language and Vision in Learning: Insights from Vision-Language Models”

“Exploring the Role of Language and Vision in Learning: Insights from Vision-Language Models”

Language and vision are undoubtedly two essential components of human intelligence. While humans have traditionally been the only example of intelligent beings, recent developments in artificial intelligence have provided us with new opportunities to study the contributions of language and vision to learning about the world. Through the creation of sophisticated Vision-Language Models (VLMs), researchers have gained insights into the role of these modalities in understanding the visual world.

The study discussed in this article focused on examining the impact of language on learning tasks using VLMs. By systematically removing different components from the cognitive architecture of these models, the researchers aimed to identify the specific contributions of language and vision to the learning process. Notably, they found that even without visual input, a language model leveraging all components was able to recover a majority of the VLM’s performance.

This finding suggests that language plays a crucial role in accessing prior knowledge and reasoning, enabling learning from limited data. It highlights the power of language in facilitating the transfer of knowledge and abstract understanding without relying solely on visual input. This insight not only has implications for the development of AI systems but also provides a deeper understanding of how humans utilize language to make sense of the visual world.

Moreover, this research leads us to ponder the broader implications of the relationship between language and vision in intelligence. How does language influence our perception and interpretation of visual information? Can language shape our understanding of the world even in the absence of direct sensory experiences? These are vital questions that warrant further investigation.

Furthermore, the findings of this study have practical implications for the development of AI systems. By understanding the specific contributions of language and vision, researchers can optimize the performance and efficiency of VLMs. Leveraging language to access prior knowledge can potentially enhance the learning capabilities of AI models, even when visual input is limited.

In conclusion, the emergence of Vision-Language Models presents an exciting avenue for studying the interplay between language and vision in intelligence. By using ablation techniques to dissect the contributions of different components, researchers are gaining valuable insights into how language enables learning from limited visual data. This research not only advances our understanding of AI systems but also sheds light on the fundamental nature of human intelligence and the role of language in shaping our perception of the visual world.

Read the original article

“Enhancing Active Speaker Detection in Noisy Environments with Audio-Visual Speech Separation”

“Enhancing Active Speaker Detection in Noisy Environments with Audio-Visual Speech Separation”

arXiv:2403.19002v1 Announce Type: new
Abstract: This paper addresses the issue of active speaker detection (ASD) in noisy environments and formulates a robust active speaker detection (rASD) problem. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. To overcome this, we propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then utilized in an ASD model, and both tasks are jointly optimized in an end-to-end framework. Our proposed framework mitigates residual noise and audio quality reduction issues that can occur in a naive cascaded two-stage framework that directly uses separated speech for ASD, and enables the two tasks to be optimized simultaneously. To further enhance the robustness of the audio features and handle inherent speech noises, we propose a dynamic weighted loss approach to train the speech separator. We also collected a real-world noise audio dataset to facilitate investigations. Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments. The framework is general and can be applied to different ASD approaches to improve their robustness. Our code, models, and data will be released.

Active Speaker Detection in Noisy Environments: A Robust Approach

Active speaker detection (ASD) is an essential task in multimedia information systems where the goal is to identify and track the speaker in a given audio or audio-visual stream. However, in real-world scenarios, the presence of ambient noise can significantly degrade the performance of ASD models. This paper introduces a robust approach, called robust active speaker detection (rASD), which addresses the challenge of detecting the active speaker accurately in noisy environments.

Existing ASD approaches leverage both audio and visual modalities to improve accuracy. However, non-speech sounds in the surrounding environment can interfere with the speaker’s voice, leading to performance degradation. To overcome this, the proposed rASD framework introduces a novel strategy that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then fed into an ASD model in an end-to-end framework, where both the speech separation and ASD tasks are jointly optimized.

This multi-disciplinary approach combines concepts from multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. The integration of audio and visual modalities aligns with the principles of multimedia information systems, where the goal is to process and analyze different forms of media simultaneously. Additionally, the use of audio-visual speech separation techniques relates to the field of animations, as it involves separating speech from non-speech sounds, similar to isolating dialogues from background noises in animated films.

The proposed rASD framework also emphasizes the importance of addressing the audio quality reduction issues that can occur in a naive cascaded two-stage framework. By jointly optimizing the speech separation and ASD tasks, the framework mitigates residual noise and improves the overall audio quality. The dynamics weighted loss approach introduced to train the speech separator further enhances the robustness of the audio features, making the framework more resilient to inherent speech noises.

To validate the effectiveness of the rASD framework, the authors conducted experiments using a real-world noise audio dataset they collected. The experiments demonstrate that non-speech audio noises have a significant impact on ASD models, confirming the need for robust approaches. The proposed rASD framework outperforms existing methods in noisy environments, offering improved accuracy and robustness.

In conclusion, this paper presents a robust approach, the rASD framework, for active speaker detection in noisy environments. The integration of audio-visual speech separation and the joint optimization of both tasks contribute to its effectiveness. The paper’s contribution extends to the wider field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities by addressing the challenges posed by ambient noise in active speaker detection.

Read the original article

Assessing Code Live-Load Models for Bridge Substructures

Assessing Code Live-Load Models for Bridge Substructures

The present paper examines the effectiveness of code live-load models in accurately estimating vehicular loads on bridge substructures. The study utilizes realistic traffic vehicle data from four Weigh-in-Motion databases, which provide an authentic representation of vehicle information. This ensures that the examination of the bridges studied is based on real-world data.

The evaluation includes various bridge models, ranging from single-span girder bridges to two-, three-, and four-span continuous pinned-support girder bridges. By comparing the extreme force values obtained from the Weigh-in-Motion databases with those predicted by selected code live-load models, the study assesses the accuracy of the models.

The exceedance rates, which indicate the frequency with which the predicted forces exceed the actual forces, are presented in a spectra format, organized by span length. The analysis reveals significant variations in these exceedance rates, underscoring the need for improvements in code live-load models to achieve more accurate estimations of the forces transferred to bridge substructures.

Enhancing the accuracy of these models is crucial in achieving more consistent reliability levels for a range of limit states, such as resistance, fatigue, serviceability, and cracking. By refining code live-load models, engineers and policymakers can ensure that bridges are designed to withstand the actual loads they will experience, leading to improved bridge safety and longevity.

Read the original article

“Spectral Convolution Transformers: Enhancing Vision with Local, Global, and Long-Range Dependence

“Spectral Convolution Transformers: Enhancing Vision with Local, Global, and Long-Range Dependence

arXiv:2403.18063v1 Announce Type: cross
Abstract: Transformers used in vision have been investigated through diverse architectures – ViT, PVT, and Swin. These have worked to improve the attention mechanism and make it more efficient. Differently, the need for including local information was felt, leading to incorporating convolutions in transformers such as CPVT and CvT. Global information is captured using a complex Fourier basis to achieve global token mixing through various methods, such as AFNO, GFNet, and Spectformer. We advocate combining three diverse views of data – local, global, and long-range dependence. We also investigate the simplest global representation using only the real domain spectral representation – obtained through the Hartley transform. We use a convolutional operator in the initial layers to capture local information. Through these two contributions, we are able to optimize and obtain a spectral convolution transformer (SCT) that provides improved performance over the state-of-the-art methods while reducing the number of parameters. Through extensive experiments, we show that SCT-C-small gives state-of-the-art performance on the ImageNet dataset and reaches 84.5% top-1 accuracy, while SCT-C-Large reaches 85.9% and SCT-C-Huge reaches 86.4%. We evaluate SCT on transfer learning on datasets such as CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. We also evaluate SCT on downstream tasks i.e. instance segmentation on the MSCOCO dataset. The project page is available on this webpage.url{https://github.com/badripatro/sct}

The Multidisciplinary Nature of Spectral Convolution Transformers

In recent years, transformers have become a popular choice for various tasks in the field of multimedia information systems, including computer vision. This article discusses the advancements made in transformer architectures for vision tasks, specifically focusing on the incorporation of convolutions and spectral representations.

Transformers, originally introduced for natural language processing, have shown promising results in vision tasks as well. Vision Transformer (ViT), PVT, and Swin are some of the architectures that have improved the attention mechanism and made it more efficient. However, researchers realized that there is a need to include local information in the attention mechanism, which led to the development of CPVT and CvT – transformer architectures that incorporate convolutions.

In addition to local information, capturing global information is also crucial in vision tasks. Various methods have been proposed to achieve global token mixing, including using a complex Fourier basis. Architectures like AFNO, GFNet, and Spectformer have implemented this global mixing of information. The combination of local, global, and long-range dependence views of data has proven to be effective in improving performance.

In this article, the focus is on investigating the simplest form of global representation – the real domain spectral representation obtained through the Hartley transform. By using a convolutional operator in the initial layers, local information is captured. These two contributions have led to the development of a new transformer architecture called Spectral Convolution Transformer (SCT).

SCT has shown improved performance over state-of-the-art methods while also reducing the number of parameters. The results on the ImageNet dataset are impressive, with SCT-C-small achieving 84.5% top-1 accuracy, SCT-C-Large reaching 85.9%, and SCT-C-Huge reaching 86.4%. The authors have also evaluated SCT on transfer learning tasks using datasets like CIFAR-10, CIFAR-100, Oxford Flower, and Stanford Car. Additionally, SCT has been tested on downstream tasks such as instance segmentation on the MSCOCO dataset.

The multidisciplinary nature of this research is noteworthy. It combines concepts from various fields such as computer vision, artificial intelligence, information systems, and signal processing. By integrating convolutions and spectral representations into transformers, the authors have pushed the boundaries of what transformers can achieve in vision tasks.

As multimedia information systems continue to evolve, the innovations in transformer architectures like SCT open up new possibilities for advancements in animations, artificial reality, augmented reality, and virtual realities. These fields heavily rely on efficient and effective processing of visual data, and transformer architectures have the potential to revolutionize how these systems are developed and utilized.

In conclusion, the introduction of spectral convolution transformers is an exciting development in the field of multimedia information systems. The combination of convolutions and spectral representations allows for the incorporation of local, global, and long-range dependence information, leading to improved performance and reduced parameters. Further exploration and application of these architectures hold great promise for multimedia applications such as animations, artificial reality, augmented reality, and virtual realities.

References:

  • ViT: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
  • PVT: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
  • Swin: Hierarchical Swin Transformers for Long-Tail Vision Tasks
  • CPVT: Convolutions in Transformers: Visual Recognition with Transformers and Convolutional Operations
  • CvT: CvT: Introducing Convolutions to Vision Transformers
  • AFNO: Attention-based Fourier Neural Operator for Nonlinear Partial Differential Equations
  • GFNet: Gather and Focus: QA with Context Attributes and Interactions
  • Spectformer: SpectFormer: Unifying Spectral and Spatial Self-Attention for Multimodal Learning

Read the original article

“Optimizing RF Receiver Performance with Circuit-centric Genetic Algorithm”

“Optimizing RF Receiver Performance with Circuit-centric Genetic Algorithm”

This paper presents a highly efficient method for optimizing parameters in analog/high-frequency circuits, specifically targeting the performance parameters of a radio-frequency (RF) receiver. The goal is to maximize the receiver’s performance by reducing power consumption and noise figure while increasing conversion gain. The authors propose a novel approach called the Circuit-centric Genetic Algorithm (CGA) to address the limitations observed in the traditional Genetic Algorithm (GA).

One of the key advantages of the CGA is its simplicity and computational efficiency compared to existing deep learning models. Deep learning models often require significant computational resources and extensive training data, which may not always be readily available in the context of analog/high-frequency circuit optimization. The CGA, on the other hand, offers a simpler inference process that can more effectively leverage available circuit parameters to optimize the performance of the RF receiver.

Furthermore, the CGA offers significant advantages over manual design and the conventional GA in terms of finding optimal points. Manual design can be a time-consuming and iterative process, requiring the designer to experiment with various circuit parameters to identify the best combination. The conventional GA, while automated, can still be computationally expensive and may not always guarantee finding the superior optimum points. The CGA, with its circuit-centric approach, aims to mitigate the designer’s workload by automating the search for the best parameter values while also enhancing the likelihood of finding superior optimum points.

Looking ahead, it would be interesting to see the CGA being applied to more complex analog/high-frequency circuits beyond RF receivers. The authors demonstrate the feasibility of the method in optimizing a receiver, but its potential application in other circuit types could greatly benefit the field. Additionally, future research could explore the combination of CGA with other optimization techniques, further enhancing its efficiency and effectiveness in tuning circuit parameters.

Read the original article