Detecting Multimodal Implicit Toxicity: Introducing ShieldVLM

arXiv:2505.14035v1 Announce Type: new
Abstract: Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.

Expert Commentary

As a expert commentator in the field of multimedia information systems and artificial realities, I find the study on toxicity detection in multimodal text-image content to be highly relevant and timely. With the rise of social platforms and the proliferation of Large Vision-Language Models (LVLMs), the issue of detecting toxicity in multimodal content becomes more complex due to the presence of implicit toxicity.

The concept of multimodal implicit toxicity, where each modality appears harmless on its own but becomes toxic when combined, is a multi-disciplinary challenge that requires a holistic approach to address. By creating a taxonomy for multimodal implicit toxicity (MMIT) and developing an MMIT-dataset with 7 risk categories and 31 sub-categories, the researchers have taken a crucial step towards understanding and detecting toxic behaviors in multimedia content.

The introduction of ShieldVLM, a model that uses cross-modal reasoning to identify implicit toxicity in multimodal statements, prompts, and dialogs, is a significant advancement in this field. By outperforming existing baselines in detecting both implicit and explicit toxicity, ShieldVLM showcases the power of multi-disciplinary research in tackling complex issues like toxicity detection in multimedia content.

Overall, this study not only contributes to the field of multimedia information systems but also has implications for the wider field of artificial realities, augmented realities, and virtual realities. As we continue to navigate the digital landscape, understanding and detecting toxic behaviors in multimodal content will be essential for creating safe and inclusive online environments.

Read the original article

“Expanding Bigraphical Reactive Systems for Real-Time Systems”

“Expanding Bigraphical Reactive Systems for Real-Time Systems”

Expert Commentary: Enhancing Bigraphical Reactive Systems for Real-Time Systems

In this article, the authors discuss the extension of Bigraphical Reactive Systems (BRSs) to support real-time systems, a significant advancement in the field of graph-rewriting formalisms. BRSs have been widely used in various domains such as communication protocols, agent programming, biology, and security due to their ability to model systems evolving in two dimensions: spatially and non-spatially.

One of the key contributions of this work is the introduction of multiple perspectives to represent digital clocks in BRSs, enabling the modelling of real-time systems. By using Action BRSs, which result in a Markov Decision Process (MDP), the authors are able to naturally represent choices in each system state, allowing for the passage of time or the execution of specific actions.

The implementation of this proposed approach using the BigraphER toolkit showcases its effectiveness through the modelling of cloud system requests and other examples. This extension opens up new possibilities for the application of BRSs in real-time systems, providing researchers and practitioners with a powerful tool for modelling and analyzing complex systems.

Future Directions

  • Further research could explore the application of this extended BRS framework to other domains beyond cloud computing, such as IoT devices, cyber-physical systems, or real-time monitoring systems.
  • It would be interesting to investigate the scalability and performance of the proposed approach when dealing with large-scale systems with multiple interconnected components.
  • Exploring the integration of formal verification techniques with Action BRSs could enhance the reliability and correctness of real-time systems modelled using this approach.

Overall, the extension of BRSs to support real-time systems represents a significant step forward in the evolution of graph-rewriting formalisms, opening up exciting new possibilities for modelling and analyzing complex systems in a wide range of application domains.

Read the original article

“CMFusion: A Novel Model for Multimodal Hate Video Detection”

arXiv:2505.12051v1 Announce Type: new
Abstract: The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model’s effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.

Expert Commentary: The Rise of Multimodal Approaches in Hate Video Detection

The proliferation of video content on social media platforms has brought about both positive and negative consequences. While it has democratized information dissemination and fostered creativity, it has also facilitated the spread of harmful content, such as hate videos. These videos often contain implicit messages that can be challenging to detect using traditional methods.

Current hate video detection approaches predominantly rely on unimodal techniques, which may not fully capture the complexity of multimedia content. Multimodal methods, on the other hand, leverage information from multiple modalities, such as text, audio, and video, to provide a more comprehensive understanding of the content. However, integrating temporal dynamics and modality-wise interactions in these approaches remains a challenge.

The CMFusion model introduced in this paper takes a step towards addressing this issue by utilizing a Channel-wise and Modality-wise Fusion Mechanism. By extracting features from different modalities and incorporating a temporal cross-attention mechanism, CMFusion aims to capture the nuanced relationships between video and audio streams. The model then processes these features using fusion modules to generate informative representations of hate videos.

Notably, the effectiveness of CMFusion is demonstrated through extensive experiments on a real-world dataset, where it outperforms five popular baselines in terms of accuracy, precision, recall, and F1 score. Ablation studies and parameter analyses further validate the design choices of the model, emphasizing its robustness in hate video detection.

From a multidisciplinary perspective, the development of CMFusion touches upon various fields, including multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. As hate videos can manifest in different forms across these modalities, a holistic approach that combines insights from diverse disciplines is essential in combating harmful content online.

In conclusion, the integration of multimodal techniques, like CMFusion, represents a promising direction in addressing the challenges of hate video detection. By leveraging the complementary features of different modalities and incorporating advanced fusion mechanisms, researchers can enhance the accuracy and effectiveness of automated content moderation systems in the digital age.

Read the original article

Enhancing Reinforcement Learning in Large Language Models with Response Diversity

Expert Commentary: Enhancing Reinforcement Learning in Large Language Models

Reinforcement Learning (RL) has become a key technique in improving the reasoning abilities of large language models (LLMs) such as DeepSeek-R1. One popular RL method, Group Relative Policy Optimization (GRPO), has been successful in training these models, but faces challenges when all sampled responses in a group are incorrect, leading to what is known as an “all-negative-sample” group. This can hinder learning progress as GRPO fails to update the policy in such cases.

The recent paper introduces a novel framework to address this issue by introducing response diversity within these all-negative-sample groups using AI feedback. The addition of this diversification not only improves learning dynamics, as shown through theoretical analysis, but also leads to enhanced performance across different model sizes and learning settings in offline and online scenarios.

This research contributes significantly to the understanding of learning dynamics in RL for LLMs, building upon recent insights from related work. By showing the feasibility and benefits of learning from all-negative-sample groups, this work opens up new avenues for enhancing the performance and capabilities of language models through reinforcement learning techniques.

Read the original article

Efficient Multimodal Metaphor Identification with CDGLT

arXiv:2505.11237v1 Announce Type: new
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces textbf{C}oncept textbf{D}rift textbf{G}uided textbf{L}ayerNorm textbf{T}uning (textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.

Expert Commentary

The ability to understand and convey metaphors is a crucial aspect of human communication and cognition. When it comes to multimodal metaphors, such as those seen in internet memes, the challenges are unique due to their unconventional expressions and implied meanings. This paper introduces the CDGLT framework, which aims to address these challenges in a training-efficient manner.

The CDGLT framework incorporates innovative concepts like Concept Drift, which leverages cross-modal embeddings to generate new, divergent concept embeddings. This helps bridge the gap between literal features and the figurative task of identifying multimodal metaphors. Additionally, the prompt construction strategy utilized in CDGLT adapts feature extraction and fusion methods using pre-trained language models, further enhancing the framework’s effectiveness.

From a multidisciplinary perspective, this research combines concepts from natural language processing, computer vision, and cognitive psychology to develop a solution for multimodal metaphor identification. By tapping into the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, the CDGLT framework showcases the potential for interdisciplinary collaboration in advancing understanding of complex cognitive processes.

Furthermore, the state-of-the-art performance of CDGLT on the MET-Meme benchmark highlights its efficacy in tackling the challenges posed by multimodal metaphors. The reduced training costs compared to existing generative methods make CDGLT a promising tool for researchers and practitioners interested in multimodal metaphor understanding.

In conclusion, the CDGLT framework represents a significant contribution to the field of multimodal metaphor identification, paving the way for more efficient and accurate methods of analyzing complex and layered forms of communication.

Read the original article

Calculating Tail Probabilities for M|D|$infty$ Queue Using FORTRAN Program

Calculating Tail Probabilities for M|D|$infty$ Queue Using FORTRAN Program

Expert Commentary: Analyzing Queuing System Distribution Functions

Queuing systems play a crucial role in various real-world applications, such as telecommunications, traffic management, and customer service. Understanding the distribution functions of busy periods and busy cycles in these systems is essential for optimizing their performance and resource utilization.

In the case of the M|G|$infty$ queue system, the lack of closed-form formulas for these distribution functions presents a significant challenge. However, the M|D|$infty$ queue stands out due to its Laplace transform expressions in closed form.

Platzman, Ammons, and Bartholdi III have developed an algorithm that leverages these closed-form expressions to compute tail probabilities efficiently. This algorithm opens up opportunities for precise calculations and analysis of distribution functions in the M|D|$infty$ queue system.

Implementation through FORTRAN Program

By implementing the algorithm in a FORTRAN program, researchers and practitioners can harness its computational power to explore complex queuing system scenarios. The program enables them to calculate tail probabilities accurately and derive valuable insights into system performance under different conditions.

Overall, the development and implementation of algorithms like the one proposed by Platzman, Ammons, and Bartholdi III are instrumental in advancing our understanding of queuing systems and enhancing their efficiency and reliability in practical applications.

Read the original article