arXiv:2505.03420v1 Announce Type: new
Abstract: Hallucinations in vision-language models (VLMs) hinder reliability and real-world applicability, usually stemming from distribution shifts between pretraining data and test samples. Existing solutions, such as retraining or fine-tuning on additional data, demand significant computational resources and labor-intensive data collection, while ensemble-based methods incur additional costs by introducing auxiliary VLMs. To address these challenges, we propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference without retraining or any auxiliary VLMs. By updating only the learnable parameters in the layer normalization of the language model (approximately 0.003% of the model parameters), our method reduces distribution shifts between test samples and pretraining samples. A CLIP-based hallucination evaluation model is proposed to provide dual rewards to VLMs. Experimental results demonstrate a 15.4% and 17.3% reduction in hallucination rates on LLaVA and InstructBLIP, respectively. Our approach outperforms state-of-the-art baselines with a 68.3% improvement in hallucination mitigation, demonstrating its effectiveness.

Expert Commentary: Mitigating Hallucinations in Vision-Language Models

In the realm of vision-language models (VLMs), hallucinations can be a significant hurdle to overcome in order to ensure the reliability and real-world applicability of these models. These hallucinations often arise from distribution shifts between the data used for pretraining and the samples encountered during testing. Previous approaches to mitigating hallucinations, such as retraining or fine-tuning on additional data, can be computationally expensive and require labor-intensive data collection efforts. Ensemble-based methods, which involve using multiple VLMs, can also incur additional costs.

However, a new test-time adaptation framework has been proposed to address these challenges by utilizing reinforcement learning to mitigate hallucinations during inference without the need for retraining or auxiliary VLMs. This approach focuses on updating only the learnable parameters in the layer normalization of the language model, representing a small fraction of the overall model parameters. By doing so, this method aims to reduce the distribution shifts between test samples and pretraining samples.

The proposed framework also introduces a CLIP-based hallucination evaluation model, which provides dual rewards to VLMs. Experimental results have shown promising outcomes, with a significant reduction in hallucination rates on benchmark datasets like LLaVA and InstructBLIP. In fact, the approach has demonstrated a 68.3% improvement in hallucination mitigation compared to state-of-the-art baselines, highlighting its effectiveness.

Multi-disciplinary Nature and Relevance to Multimedia Information Systems

The concept of mitigating hallucinations in VLMs is highly relevant to the wider field of multimedia information systems, particularly in the context of animations, artificial reality, augmented reality, and virtual realities. These systems often rely on the seamless integration of visual and textual information, making VLMs a crucial component for generating coherent and contextually relevant content.

By leveraging reinforcement learning and CLIP-based evaluation models, this framework showcases a multi-disciplinary approach that combines concepts from both machine learning and computer vision. This not only enhances the robustness of VLMs but also opens up new possibilities for improving the overall quality of multimedia content generated by these systems.

Overall, the proposed test-time adaptation framework represents a significant advancement in addressing the challenges associated with hallucinations in vision-language models, offering a promising solution that could have wide-reaching implications across various multimedia information systems and related technologies.

Read the original article