Visual question answering is a multimodal task that requires the joint
comprehension of visual and textual information. However, integrating visual
and textual semantics solely through attention layers is insufficient to
comprehensively understand and align information from both modalities.
Intuitively, object attributes can naturally serve as a bridge to unify them,
which has been overlooked in previous research. In this paper, we propose a
novel VQA approach from the perspective of utilizing object attribute, aiming
to achieve better object-level visual-language alignment and multimodal scene
understanding. Specifically, we design an attribute fusion module and a
contrastive knowledge distillation module. The attribute fusion module
constructs a multimodal graph neural network to fuse attributes and visual
features through message passing. The enhanced object-level visual features
contribute to solving fine-grained problem like counting-question. The better
object-level visual-language alignment aids in understanding multimodal scenes,
thereby improving the model’s robustness. Furthermore, to augment scene
understanding and the out-of-distribution performance, the contrastive
knowledge distillation module introduces a series of implicit knowledge. We
distill knowledge into attributes through contrastive loss, which further
strengthens the representation learning of attribute features and facilitates
visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA,
VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the
proposed method.

Visual question answering (VQA) is a challenging task that requires the integration of visual and textual information. Previous research has focused on using attention layers to align and comprehend these modalities. However, this approach has overlooked the potential of utilizing object attributes as a bridge between visuals and text.

In this paper, the authors propose a novel VQA approach that leverages object attributes for better visual-language alignment and multimodal scene understanding. They introduce an attribute fusion module, which utilizes a multimodal graph neural network to fuse attributes and visual features through message passing. This module enhances object-level visual features, improving performance on fine-grained tasks such as counting questions. Additionally, the improved object-level visual-language alignment aids in understanding complex multimodal scenes, enhancing the model’s overall robustness.

To further enhance scene understanding and out-of-distribution performance, the authors introduce a contrastive knowledge distillation module. This module leverages implicit knowledge through contrastive loss to distill knowledge into attributes. This process strengthens the representation learning of attribute features and facilitates visual-linguistic alignment.

The proposed method is evaluated on six datasets, including COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs, and TDIUC. The experimental results demonstrate the superiority of the proposed approach in comparison to existing methods.

One of the notable aspects of this work is its multi-disciplinary nature. It combines techniques from computer vision, natural language processing, and graph neural networks to tackle the complex task of VQA. By incorporating object attributes, the authors bridge the semantic gap between text and visuals, enabling a more comprehensive understanding of multimodal data.

The attribute fusion module is an interesting contribution that addresses the limitations of previous attention-based approaches. By constructing a multimodal graph neural network, the module effectively integrates object attributes with visual features, facilitating better object-level visual-language alignment. This approach not only improves performance on fine-grained tasks but also enhances the model’s ability to understand complex scenes.

Additionally, the contrastive knowledge distillation module introduces a novel way of leveraging implicit knowledge. By distilling knowledge into attributes through contrastive loss, the module strengthens representation learning and improves visual-linguistic alignment. This approach not only augments scene understanding but also enhances the model’s out-of-distribution performance.

In conclusion, this paper presents a novel VQA approach that utilizes object attributes to achieve better visual-language alignment and multimodal scene understanding. The attribute fusion module and contrastive knowledge distillation module contribute to the superiority of the proposed method. This work highlights the importance of considering multiple modalities and employing multi-disciplinary techniques in VQA research.

Read the original article