Self-supervised representation learning for human action recognition has
developed rapidly in recent years. Most of the existing works are based on
skeleton data while using a multi-modality setup. These works overlooked the
differences in performance among modalities, which led to the propagation of
erroneous knowledge between modalities while only three fundamental modalities,
i.e., joints, bones, and motions are used, hence no additional modalities are
explored.

In this work, we first propose an Implicit Knowledge Exchange Module (IKEM)
which alleviates the propagation of erroneous knowledge between low-performance
modalities. Then, we further propose three new modalities to enrich the
complementary information between modalities. Finally, to maintain efficiency
when introducing new modalities, we propose a novel teacher-student framework
to distill the knowledge from the secondary modalities into the mandatory
modalities considering the relationship constrained by anchors, positives, and
negatives, named relational cross-modality knowledge distillation. The
experimental results demonstrate the effectiveness of our approach, unlocking
the efficient use of skeleton-based multi-modality data. Source code will be
made publicly available at https://github.com/desehuileng0o0/IKEM.

Self-supervised representation learning for human action recognition has seen significant advancements in recent years. While most existing works in this field have focused on skeleton data and utilized a multi-modality setup, they have overlooked the variations in performance among different modalities. As a result, erroneous knowledge can be propagated between modalities. Additionally, these works have mainly explored three fundamental modalities: joints, bones, and motions, without investigating additional modalities.

In order to address these limitations, the authors of this work propose an Implicit Knowledge Exchange Module (IKEM). This module aims to mitigate the propagation of erroneous knowledge between low-performance modalities. Moreover, the authors introduce three new modalities to enhance the complementary information between different modalities.

To ensure efficiency while incorporating new modalities, the authors also present a novel teacher-student framework called relational cross-modality knowledge distillation. This framework allows for the transfer of knowledge from secondary modalities to mandatory modalities based on anchor points, positive examples, and negative examples.

This work’s experimental results demonstrate the effectiveness of the proposed approach in leveraging skeleton-based multi-modality data efficiently for human action recognition. By addressing the limitations of previous approaches and introducing novel techniques, this research contributes to the wider field of multimedia information systems, with a specific focus on animations, artificial reality, augmented reality, and virtual realities.

The concepts explored in this work highlight the multi-disciplinary nature of multimedia information systems. The integration of various modalities and the development of novel frameworks require expertise in computer vision, machine learning, human-computer interaction, and graphics. Moreover, the proposed IKEM module and relational cross-modality knowledge distillation framework provide valuable insights into how knowledge can be effectively exchanged and distilled across different modalities. These insights can potentially be applied to other domains within multimedia information systems, such as object recognition, scene understanding, and video analysis.

In conclusion, this work contributes to the advancement of human action recognition using a multi-modality approach. By addressing the limitations of previous works, introducing new modalities, and proposing novel frameworks, this research provides valuable insights into the efficient utilization of skeleton-based multi-modality data. The concepts discussed in this work have implications for the broader field of multimedia information systems, including areas such as animations, artificial reality, augmented reality, and virtual realities.

Source code: https://github.com/desehuileng0o0/IKEM

Read the original article