arXiv:2403.13480v1 Announce Type: cross
Abstract: Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges — enforcing the multimodal samples to emph{align incorrect semantics} and emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.

Cross-Modal Retrieval and Supervised CMR

Cross-modal retrieval (CMR) is a field that deals with establishing interaction between different modalities, such as text, images, and videos. This allows users to search and retrieve information across different types of media. Within CMR, supervised CMR is emerging as a popular approach due to its flexibility in learning semantic category discrimination.

Supervised CMR methods have shown remarkable performance, but their success heavily relies on well-annotated data. The problem arises when dealing with unimodal or multimodal data that is collected from the Internet with coarse annotation. Coarse annotation introduces noisy labels, making it challenging to train models effectively. This is where UOT-RCL, the Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval, comes into play.

The Challenges and Solutions

Two key challenges arise when training with noisy labels in cross-modal retrieval. The first challenge is aligning incorrect semantics between multimodal samples. This means that the noisy labels may not accurately represent the underlying semantic content, leading to poor retrieval performance. The second challenge is the heterogeneous gap between different modalities. Noisy labels can widen this gap, making it harder to establish meaningful cross-modal connections.

The UOT-RCL framework tackles these challenges by proposing two main components. The first component is a semantic alignment based on partial OT. This approach progressively corrects the noisy labels by leveraging a cross-modal consistent cost function. This cost function blends information from different modalities and provides a more precise transport cost. By correcting the noisy labels, the UOT-RCL framework aims to align the semantics of multimodal samples more accurately.

The second component of UOT-RCL is an OT-based relation alignment. This component focuses on narrowing the discrepancy in multi-modal data. It infers semantic-level cross-modal matching, helping to establish meaningful connections between different modalities. By leveraging the inherent correlation among multimodal data, this component contributes to an effective cost function.

Relation to Multimedia Information Systems

The UOT-RCL framework has strong ties to the field of multimedia information systems. Multimedia information systems deal with managing and retrieving different types of media, including images, videos, and text. Cross-modal retrieval is a fundamental problem in this field, as it enables users to search and retrieve relevant information from multiple modalities.

UOT-RCL adds to the existing techniques and methods used in multimedia information systems by providing a framework specifically designed for robust cross-modal retrieval. By addressing the challenges of aligning semantics and narrowing the gap between modalities, UOT-RCL improves the retrieval performance of multimodal data. This has practical implications for multimedia information systems, as it allows for more accurate and efficient retrieval of relevant information across different types of media.

Connections to Animations, Artificial Reality, Augmented Reality, and Virtual Realities

While the UOT-RCL framework itself does not directly deal with animations, artificial reality, augmented reality, or virtual realities, its principles and techniques can have broader implications in these fields.

Animations, artificial reality, augmented reality, and virtual realities often involve the integration of different modalities, such as visual and auditory cues. Cross-modal retrieval techniques like UOT-RCL can help improve the integration and synchronization of these modalities, leading to more immersive and realistic experiences. The framework’s focus on aligning semantics and narrowing the gap between modalities also contributes to creating more coherent and meaningful experiences in these fields.

Furthermore, the UOT-RCL framework’s reliance on unimodal and multimodal data also aligns with the data sources commonly used in animations, artificial reality, augmented reality, and virtual realities. As these fields continue to advance, the ability to retrieve and manage multimodal data effectively becomes increasingly important. The UOT-RCL framework’s approach to handling noisy labels and leveraging inherent correlations can be valuable in improving the quality and reliability of the data used in these fields.
Read the original article