Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.

Succinctness and Robustness in Skeleton-based Action Recognition

Skeleton-based action recognition has gained significant attention in the field of multimedia information systems due to its potential for achieving succinct and robust results. This approach involves analyzing the motion trajectories of human skeleton joints to classify different actions. However, a major challenge in this area is the minimal inter-class variation in similar action sequences, which often leads to confusion.

The Challenge of Mining Subtle Differences

The spatiotemporal coupling characteristics inherent in skeleton-based action recognition make it difficult to mine the subtle differences in joint motion trajectories. These subtle differences are crucial for accurately distinguishing fine-grained actions that are otherwise confusingly similar. To address this challenge, the proposed Wavelet-Attention Decoupling (WAD) module utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain.

Recalibrating Temporal Responses with Decoupling Attention

The WAD module is further enhanced with decoupling attention, which adaptively recalibrates the temporal responses of disentangled motion features. This adaptive recalibration helps amplify the discrepancies between subtle motion features, making it easier to discriminate fine-grained actions. The utilization of wavelet transform and decoupling attention reflects the multi-disciplinary nature of this approach, combining concepts from signal processing and neural network architectures.

Enhancing Attention with Fine-grained Contrastive Learning

To further enhance the attention towards trajectory features, the proposed Fine-grained Contrastive Enhancement (FCE) module employs contrastive learning techniques. This module amplifies the discrepancies in subtle motion features through a comparative analysis, enabling better discrimination of fine-grained actions. This integration of contrastive learning methods demonstrates the interdisciplinarity of multimedia information systems with machine learning and computer vision techniques.

Evaluating the Proposed Methods

To evaluate the effectiveness of the proposed methods, extensive experiments are conducted on two datasets: the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. The results show that the proposed methods perform competitively compared to state-of-the-art methods in skeleton-based action recognition. The ability to discriminate confusing fine-grained actions well highlights the potential for these methods to improve various applications, such as video surveillance, motion analysis, and human-computer interaction.

In conclusion, this article presents a novel approach to address the challenges of skeleton-based action recognition. By incorporating wavelet transform, decoupling attention, and contrastive learning techniques, this approach offers enhanced discrimination capabilities for fine-grained actions. The integration of concepts from signal processing, neural networks, and machine learning showcases the multi-disciplinary nature of multimedia information systems. Future research may focus on exploring the application of these methods in other domains, such as virtual reality and augmented reality, where accurate recognition of human actions is crucial for immersive experiences.

Read the original article