arXiv:2505.14035v1 Announce Type: new
Abstract: Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.

Expert Commentary

As a expert commentator in the field of multimedia information systems and artificial realities, I find the study on toxicity detection in multimodal text-image content to be highly relevant and timely. With the rise of social platforms and the proliferation of Large Vision-Language Models (LVLMs), the issue of detecting toxicity in multimodal content becomes more complex due to the presence of implicit toxicity.

The concept of multimodal implicit toxicity, where each modality appears harmless on its own but becomes toxic when combined, is a multi-disciplinary challenge that requires a holistic approach to address. By creating a taxonomy for multimodal implicit toxicity (MMIT) and developing an MMIT-dataset with 7 risk categories and 31 sub-categories, the researchers have taken a crucial step towards understanding and detecting toxic behaviors in multimedia content.

The introduction of ShieldVLM, a model that uses cross-modal reasoning to identify implicit toxicity in multimodal statements, prompts, and dialogs, is a significant advancement in this field. By outperforming existing baselines in detecting both implicit and explicit toxicity, ShieldVLM showcases the power of multi-disciplinary research in tackling complex issues like toxicity detection in multimedia content.

Overall, this study not only contributes to the field of multimedia information systems but also has implications for the wider field of artificial realities, augmented realities, and virtual realities. As we continue to navigate the digital landscape, understanding and detecting toxic behaviors in multimodal content will be essential for creating safe and inclusive online environments.

Read the original article