Text-to-image person re-identification (TIReID) aims to retrieve the target
person from an image gallery via a textual description query. Recently,
pre-trained vision-language models like CLIP have attracted significant
attention and have been widely utilized for this task due to their robust
capacity for semantic concept learning and rich multi-modal knowledge. However,
recent CLIP-based TIReID methods commonly rely on direct fine-tuning of the
entire network to adapt the CLIP model for the TIReID task. Although these
methods show competitive performance on this topic, they are suboptimal as they
necessitate simultaneous domain adaptation and task adaptation. To address this
issue, we attempt to decouple these two processes during the training stage.
Specifically, we introduce the prompt tuning strategy to enable domain
adaptation and propose a two-stage training approach to disentangle domain
adaptation from task adaptation. In the first stage, we freeze the two encoders
from CLIP and solely focus on optimizing the prompts to alleviate domain gap
between the original training data of CLIP and downstream tasks. In the second
stage, we maintain the fixed prompts and fine-tune the CLIP model to prioritize
capturing fine-grained information, which is more suitable for TIReID task.
Finally, we evaluate the effectiveness of our method on three widely used
datasets. Compared to the directly fine-tuned approach, our method achieves
significant improvements.

Text-to-Image Person Re-Identification and the Role of CLIP

Text-to-image person re-identification (TIReID) is a challenging task that involves retrieving target individuals from an image gallery using textual description queries. Recently, pre-trained vision-language models like CLIP have gained attention and have been widely applied to this task. CLIP models excel in semantic concept learning and possess rich multi-modal knowledge, making them suitable for TIReID.

However, existing CLIP-based TIReID methods face a limitation: they often directly fine-tune the entire CLIP network to adapt it for the TIReID task. While these methods exhibit competitive performance, they are suboptimal because they require simultaneous domain adaptation and task adaptation.

To overcome this limitation, a new approach is proposed that decouples the processes of domain adaptation and task adaptation during the training stage. This approach consists of two stages:

Prompt Tuning for Domain Adaptation

In the first stage, the two encoders from the CLIP model are frozen, and the focus shifts to optimizing the prompts. The prompts serve as instructions for the model to generate relevant representations. By solely optimizing the prompts, the aim is to address the domain gap between the original training data of CLIP and the downstream TIReID task. This process allows for effective domain adaptation without interfering with task adaptation.

Fine-Tuning for Task Adaptation

In the second stage, the prompts are fixed, and fine-tuning is performed on the CLIP model. The objective is to prioritize capturing fine-grained information that is more suitable for the TIReID task. This fine-tuning process helps enhance the model’s performance in accurately recognizing and identifying individuals based on textual queries.

The proposed method’s effectiveness is demonstrated through evaluations on three widely used datasets. Comparative analysis against the directly fine-tuned approach reveals significant improvements achieved by the two-stage training approach.

Multi-disciplinary Nature of the Concepts

This research combines concepts from computer vision, natural language processing, and machine learning. The integration of vision and language enables the model to understand textual descriptions and translate them into meaningful visual representations. The use of pre-trained models like CLIP leverages the knowledge from large-scale datasets, fostering transfer learning across domains. Additionally, the training strategy involves both domain adaptation, essential for handling domain shifts between CLIP and TIReID, and task adaptation, targeting the specific requirements of person re-identification.

With further advancements in multi-modal learning and representation techniques, text-to-image person re-identification has the potential to find applications in surveillance systems, social media analysis, and more. Future research efforts should focus on exploring novel architectures, improving training methodologies, and expanding the datasets to enhance the accuracy and robustness of TIReID systems.

Read the original article