Audio and video are two most common modalities in the mainstream media
platforms, e.g., YouTube. To learn from multimodal videos effectively, in this
work, we propose a novel audio-video recognition approach termed audio video
Transformer, AVT, leveraging the effective spatio-temporal representation by
the video Transformer to improve action recognition accuracy. For multimodal
fusion, simply concatenating multimodal tokens in a cross-modal Transformer
requires large computational and memory resources, instead we reduce the
cross-modality complexity through an audio-video bottleneck Transformer. To
improve the learning efficiency of multimodal Transformer, we integrate
self-supervised objectives, i.e., audio-video contrastive learning, audio-video
matching, and masked audio and video learning, into AVT training, which maps
diverse audio and video representations into a common multimodal representation
space. We further propose a masked audio segment loss to learn semantic audio
activities in AVT. Extensive experiments and ablation studies on three public
datasets and two in-house datasets consistently demonstrate the effectiveness
of the proposed AVT. Specifically, AVT outperforms its previous
state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one
of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by
leveraging the audio signal. Compared to one of the previous state-of-the-art
multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and
improves the accuracy by 3.8% on Epic-Kitchens-100.
In this article, the authors propose a novel approach called audio video Transformer (AVT) to effectively learn from multimodal videos. They aim to improve action recognition accuracy by leveraging the spatio-temporal representation provided by the video Transformer. However, instead of simply concatenating multimodal tokens in a cross-modal Transformer, they introduce an audio-video bottleneck Transformer to reduce computational and memory resources required for multimodal fusion.
One interesting aspect of this approach is the integration of self-supervised objectives into AVT training. This includes audio-video contrastive learning, audio-video matching, and masked audio and video learning. By mapping diverse audio and video representations into a common multimodal representation space, they enhance the learning efficiency of the multimodal Transformer.
The authors also propose a masked audio segment loss to specifically learn semantic audio activities in AVT. This is a valuable addition as it allows for more nuanced understanding of the audio component in multimodal videos.
The experimental results and ablation studies conducted on various datasets show the effectiveness of AVT. It outperforms previous state-of-the-art approaches on Kinetics-Sounds by 8% and on VGGSound by 10% by leveraging the audio signal. Additionally, compared to a previous multimodal method called MBT, AVT is more efficient in terms of FLOPs and improves accuracy by 3.8% on Epic-Kitchens-100.
This work demonstrates the multi-disciplinary nature of multimedia information systems and its intersection with concepts such as animations, artificial reality, augmented reality, and virtual realities. The effective recognition and understanding of audio and video content in multimodal videos have significant implications in various fields, including entertainment, education, healthcare, and communication.