arXiv:2403.10943v1 Announce Type: new
Abstract: Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
Introduction
Multimodal intent recognition is a complex task that involves understanding human intentions through the incorporation of non-verbal modalities from real-world contexts. In order to enhance the comprehension of human intentions, it is crucial to have access to large-scale benchmark datasets that accurately capture the intricacies of multi-party conversational interactions. However, existing datasets in this field suffer from limitations in scale and difficulties in handling out-of-scope samples.
MIntRec2.0: A Comprehensive Benchmark Dataset
The MIntRec2.0 dataset aims to address these limitations by providing a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. The dataset consists of 1,245 dialogues with a total of 15,040 samples. Each sample is annotated within a new intent taxonomy comprising 30 fine-grained classes. Notably, the dataset includes both in-scope samples (9,304) and out-of-scope samples (5,736) that naturally occur in multi-turn contexts.
The Importance of Multi-disciplinarity
The nature of multimodal intent recognition highlights the interdisciplinary nature of this field. It requires expertise in areas such as natural language processing, computer vision, machine learning, and cognitive science. By incorporating non-verbal modalities and contextual information, researchers are able to develop more accurate and comprehensive models for understanding human intentions in conversational interactions.
Related to Multimedia Information Systems
Multimedia information systems play a crucial role in multimodal intent recognition. The integration of various modalities, including text, images, and audio, enables a more comprehensive understanding of human intentions. The MIntRec2.0 dataset provides a valuable resource for exploring new techniques and algorithms in the field of multimedia information systems, and offers opportunities for advancements in areas such as multimodal fusion, feature extraction, and classification.
Animations, Artificial Reality, Augmented Reality, and Virtual Realities
In the context of animations, artificial reality, augmented reality, and virtual realities, multimodal intent recognition can greatly enhance user experiences. By understanding human intentions through multiple modalities, these technologies can tailor their responses and interactions to meet users’ needs and preferences. For example, in virtual reality environments, the ability to accurately recognize and interpret human intentions can enable more realistic and immersive experiences.
Evaluation and Future Directions
The MIntRec2.0 dataset provides a solid foundation for evaluating the performance of existing multimodal fusion methods, language models such as ChatGPT, and human evaluators in the field of multimodal intent recognition. However, it also highlights the challenges that remain, particularly in effectively leveraging context information and detecting out-of-scope samples. Notably, large language models still exhibit a significant performance gap compared to humans, emphasizing the limitations of current machine learning methods in cognitive intent understanding tasks.
In the future, research in this field could focus on developing more advanced multimodal fusion methods, improving context understanding, and addressing the challenges associated with out-of-scope detection. Additionally, efforts to bridge the performance gap between machine learning methods and human performance could lead to significant advancements in the field of multimodal intent recognition.
Conclusion
The MIntRec2.0 dataset serves as a valuable resource for researchers and practitioners working in the field of human-machine conversational interactions. By providing a large-scale benchmark dataset and comprehensive information on multi-party conversations, it lays the groundwork for advancements in multimodal intent recognition. The interdisciplinary nature of this field, along with its connections to multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, further highlight its potential for transforming various domains and applications.