Existing multimodal sentiment analysis tasks are highly rely on the
assumption that the training and test sets are complete multimodal data, while
this assumption can be difficult to hold: the multimodal data are often
incomplete in real-world scenarios. Therefore, a robust multimodal model in
scenarios with randomly missing modalities is highly preferred. Recently,
CLIP-based multimodal foundational models have demonstrated impressive
performance on numerous multimodal tasks by learning the aligned cross-modal
semantics of image and text pairs, but the multimodal foundational models are
also unable to directly address scenarios involving modality absence. To
alleviate this issue, we propose a simple and effective framework, namely TRML,
Toward Robust Multimodal Learning using Multimodal Foundational Models. TRML
employs generated virtual modalities to replace missing modalities, and aligns
the semantic spaces between the generated and missing modalities. Concretely,
we design a missing modality inference module to generate virtual modaliites
and replace missing modalities. We also design a semantic matching learning
module to align semantic spaces generated and missing modalities. Under the
prompt of complete modality, our model captures the semantics of missing
modalities by leveraging the aligned cross-modal semantic space. Experiments
demonstrate the superiority of our approach on three multimodal sentiment
analysis benchmark datasets, CMU-MOSI, CMU-MOSEI, and MELD.

In this article, the authors propose a framework called TRML (Toward Robust Multimodal Learning using Multimodal Foundational Models) to address the issue of incomplete multimodal data in real-world scenarios. While existing multimodal sentiment analysis tasks assume complete multimodal data, this is often not the case in practice. Therefore, a robust multimodal model that can handle randomly missing modalities is highly desirable.

The authors highlight the success of CLIP-based multimodal foundational models in various multimodal tasks by learning the aligned cross-modal semantics of image and text pairs. However, these models are unable to directly tackle scenarios involving modality absence. To overcome this limitation, TRML introduces the concept of generated virtual modalities to replace missing modalities and align the semantic spaces between them.

The framework consists of two main modules: the missing modality inference module and the semantic matching learning module. The missing modality inference module generates virtual modalities to replace the missing ones. This helps capture the semantics of the missing modalities by leveraging the aligned cross-modal semantic space. The semantic matching learning module is designed to align the semantic spaces of the generated and missing modalities, ensuring coherence in the representation.

To evaluate the effectiveness of TRML, the authors conducted experiments on three benchmark datasets for multimodal sentiment analysis: CMU-MOSI, CMU-MOSEI, and MELD. The results demonstrate the superiority of TRML compared to existing approaches.

This research showcases the multi-disciplinary nature of the concepts involved. It combines elements of natural language processing, computer vision, and machine learning to address the challenges of handling incomplete multimodal data. By leveraging CLIP-based multimodal foundational models and introducing virtual modalities, TRML expands the capabilities of existing models in tackling real-world scenarios with randomly missing modalities.

Future advancements in this field could further explore the integration of other modalities, such as audio or sensory data, and investigate the impact of missing modality inference on different types of multimodal tasks. Additionally, exploring the scalability of the proposed framework to larger datasets and real-time applications would be valuable in further validating its robustness and efficiency.

Read the original article