This paper addresses the task of counting human actions of interest using
sensor data from wearable devices. We propose a novel exemplar-based framework,
allowing users to provide exemplars of the actions they want to count by
vocalizing predefined sounds ”one”, ”two”, and ”three”. Our method first
localizes temporal positions of these utterances from the audio sequence. These
positions serve as the basis for identifying exemplars representing the action
class of interest. A similarity map is then computed between the exemplars and
the entire sensor data sequence, which is further fed into a density estimation
module to generate a sequence of estimated density values. Summing these
density values provides the final count. To develop and evaluate our approach,
we introduce a diverse and realistic dataset consisting of real-world data from
37 subjects and 50 action categories, encompassing both sensor and audio data.
The experiments on this dataset demonstrate the viability of the proposed
method in counting instances of actions from new classes and subjects that were
not part of the training data. On average, the discrepancy between the
predicted count and the ground truth value is 7.47, significantly lower than
the errors of the frequency-based and transformer-based methods. Our project,
code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.

A Novel Exemplar-Based Framework for Counting Human Actions using Wearable Devices

In this paper, the authors propose a novel exemplar-based framework for counting human actions using sensor data from wearable devices. The framework allows users to provide exemplars of the actions they want to count by vocalizing predefined sounds. This approach has the potential to revolutionize action counting as it eliminates the need for manual annotation of training data and can adapt to new classes and subjects.

The first step of the proposed method is to localize the temporal positions of the vocalized utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. By using vocalized sounds as exemplars, the framework leverages audio data in addition to sensor data, making it a multi-disciplinary approach.

Once the exemplars are identified, a similarity map is computed between the exemplars and the entire sensor data sequence. This similarity map captures the resemblance between the actions being counted and the sensor data, further enhancing the accuracy of the counting process.

The computed similarity map is then fed into a density estimation module, which generates a sequence of estimated density values. These density values are then summed to provide the final count of instances of the actions being counted.

To evaluate the effectiveness of their approach, the authors introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories. This dataset encompasses both sensor and audio data, making it suitable for training and testing their exemplar-based framework.

The experiments conducted on this dataset demonstrate the viability of their proposed method in counting instances of actions from new classes and subjects that were not part of the training data. The average discrepancy between the predicted count and the ground truth value is 7.47, which is significantly lower than the errors of frequency-based and transformer-based methods. This indicates that the exemplar-based framework outperforms existing methods in action counting.

The multi-disciplinary nature of this approach is noteworthy. By combining sensor data and audio data, the framework takes into account both physical movements and vocal cues, resulting in a more accurate counting process. This interdisciplinary approach opens up new possibilities for action recognition and counting in various domains, such as healthcare, sports, and surveillance.

In conclusion, the authors have presented a novel exemplar-based framework for counting human actions using wearable devices. Their approach leverages vocalized utterances as exemplars and combines sensor data with audio data to improve the accuracy of action counting. The experiments conducted on their diverse dataset demonstrate the effectiveness of the proposed method, paving the way for further advancements in multi-disciplinary action recognition and counting.

For more information, including the project code and dataset, visit https://github.com/cvlab-stonybrook/ExRAC.

Read the original article