arXiv:2502.18495v1 Announce Type: new
Abstract: Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user’s desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration.

Composed Image Retrieval: A Comprehensive Review

Introduction

Composed Image Retrieval (CIR) is a challenging task that allows users to search for target images using a multimodal query. This query consists of a reference image and a modification text that specifies the user’s desired changes to the reference image. CIR has gained significant academic and practical value, resulting in a rapidly growing interest in the fields of computer vision and machine learning.

Multi-disciplinary Nature of CIR

CIR is a multi-disciplinary field that requires expertise in various domains. It combines principles from computer vision, natural language processing, and information retrieval. The computer vision aspect involves understanding and analyzing the visual content of images, while natural language processing helps to interpret and analyze the modification text. Information retrieval techniques are utilized to match the query with relevant images in the database.

Advances in Deep Learning

The recent advances in deep learning have significantly impacted CIR research. Deep learning models, especially those based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have shown remarkable performance in various computer vision tasks. These models have also been successfully applied to CIR, enabling more accurate image retrieval based on both visual and textual information.

Existing CIR Models

In this review, insights from over 120 publications in top conferences and journals are synthesized. The reviewed papers cover a range of CIR models. The authors systematically categorize existing models based on a fine-grained taxonomy, covering both supervised and zero-shot learning approaches. This categorization provides a comprehensive overview of the different methodologies employed in CIR.

Related Tasks

In addition to CIR, the review also briefly discusses related tasks such as attribute-based CIR and dialog-based CIR. Attribute-based CIR focuses on retrieving images based on specific attributes or characteristics specified in the modification text. Dialog-based CIR involves a conversational setting between the user and the system to search for images based on a series of queries and responses.

Evaluation and Analysis

The review summarizes benchmark datasets used for evaluating CIR models. It also compares the experimental results across multiple datasets for both supervised and zero-shot CIR methods. This analysis provides valuable insights into the strengths and weaknesses of different approaches and highlights areas for improvement in future research.

Promising Future Directions

The review concludes by discussing promising future directions for CIR research. It suggests potential areas of exploration such as incorporating user feedback to improve retrieval accuracy, exploring novel approaches for combining visual and textual modalities, and exploring the application of CIR in real-world scenarios such as e-commerce and content creation.

Conclusion

This comprehensive review of Composed Image Retrieval (CIR) provides a timely overview of this emerging field. It highlights the multi-disciplinary nature of CIR and its relation to computer vision, natural language processing, and information retrieval. The review categorizes and analyzes existing CIR models, discusses related tasks, and presents future research directions. The insights and findings presented in this review will be valuable to researchers interested in further exploration of CIR and its applications in multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.

Read the original article