arXiv:2505.11237v1 Announce Type: new
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces textbf{C}oncept textbf{D}rift textbf{G}uided textbf{L}ayerNorm textbf{T}uning (textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
Expert Commentary
The ability to understand and convey metaphors is a crucial aspect of human communication and cognition. When it comes to multimodal metaphors, such as those seen in internet memes, the challenges are unique due to their unconventional expressions and implied meanings. This paper introduces the CDGLT framework, which aims to address these challenges in a training-efficient manner.
The CDGLT framework incorporates innovative concepts like Concept Drift, which leverages cross-modal embeddings to generate new, divergent concept embeddings. This helps bridge the gap between literal features and the figurative task of identifying multimodal metaphors. Additionally, the prompt construction strategy utilized in CDGLT adapts feature extraction and fusion methods using pre-trained language models, further enhancing the framework’s effectiveness.
From a multidisciplinary perspective, this research combines concepts from natural language processing, computer vision, and cognitive psychology to develop a solution for multimodal metaphor identification. By tapping into the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, the CDGLT framework showcases the potential for interdisciplinary collaboration in advancing understanding of complex cognitive processes.
Furthermore, the state-of-the-art performance of CDGLT on the MET-Meme benchmark highlights its efficacy in tackling the challenges posed by multimodal metaphors. The reduced training costs compared to existing generative methods make CDGLT a promising tool for researchers and practitioners interested in multimodal metaphor understanding.
In conclusion, the CDGLT framework represents a significant contribution to the field of multimodal metaphor identification, paving the way for more efficient and accurate methods of analyzing complex and layered forms of communication.