arXiv:2412.18416v1 Announce Type: new
Abstract: Current conversational recommendation systems focus predominantly on text. However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine-tuning experiments on three MLLMs demonstrate Muse’s learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at url{https://anonymous.4open.science/r/Muse-0086}.

Multimodal Conversational Recommendation Systems: Bridging the Gap Between Research and Practice

Current conversational recommendation systems primarily focus on text-based interactions, but real-world recommendation settings involve a fusion of various modalities such as text, images, and voice. This leads to a significant gap between existing research and practical applications. To address this challenge, the authors introduce Muse, the first multimodal conversational recommendation dataset.

Muse consists of 83,148 utterances collected from 7,000 conversations specifically centered around the Clothing domain. What sets Muse apart is the inclusion of comprehensive multimodal interactions, rich elements, and natural dialogues. The dataset is automatically synthesized using a multi-agent framework powered by multimodal large language models (MLLMs). This approach leverages real-world scenarios to derive user profiles, enabling better scalability without relying solely on manual design or historical data.

The conversations in Muse are meticulously designed to simulate and optimize conversational scenarios, making them highly relevant to real-world recommendation systems. The quality of these conversations is verified through evaluations conducted by both human experts and the MLLMs. Both evaluations demonstrate the high quality of the Muse dataset.

Furthermore, the authors conduct fine-tuning experiments on three different MLLMs, providing valuable insights into the learnable patterns for recommendations and responses within Muse. These experiments confirm the dataset’s effectiveness in training multimodal conversational recommendation models.

The Muse dataset addresses the multi-disciplinary nature of multimodal conversational recommendation systems. By incorporating multiple modalities, it brings together the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.

To summarize, Muse is an innovative and comprehensive multimodal conversational recommendation dataset that bridges the gap between research and practical applications. Its inclusion of multimodal interactions and natural dialogues make it an invaluable resource for training and evaluating cutting-edge recommendation systems. Researchers and practitioners in the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities will greatly benefit from Muse’s insights and potential for advancements in multimodal conversational recommendation systems.

Source: https://anonymous.4open.science/r/Muse-0086

Read the original article