Introducing TEMP3D: Enhancing 3D Pose Estimation in Video Sequences with Temporal Continuity and Human Motion Priors

Existing 3D human pose estimation methods have proven to be effective in both monocular and multi-view settings. However, these methods struggle when faced with heavy occlusions, limiting their practical application. In this article, we explore the potential of using temporal continuity and human motion priors to improve 3D pose estimation in video sequences, even when there are occlusions present. Our approach, named TEMP3D, leverages large-scale pre-training on 3D poses and self-supervised learning to provide a temporally continuous 3D pose estimate on unlabelled in-the-wild videos. By aligning a motion prior model using state-of-the-art single image-based 3D pose estimation methods, TEMP3D is able to produce accurate and continuous outputs under occlusions. To validate our method, we conducted tests on the Occluded Human3.6M dataset, which includes significant human body occlusions. The results exceeded the state-of-the-art on this dataset, as well as the OcMotion dataset, while maintaining competitive performance on non-occluded data. For more information on our groundbreaking approach to enhancing 3D pose estimation in video sequences, click here.

Abstract:Existing 3D human pose estimation methods perform remarkably well in both monocular and multi-view settings. However, their efficacy diminishes significantly in the presence of heavy occlusions, which limits their practical utility. For video sequences, temporal continuity can help infer accurate poses, especially in heavily occluded frames. In this paper, we aim to leverage this potential of temporal continuity through human motion priors, coupled with large-scale pre-training on 3D poses and self-supervised learning, to enhance 3D pose estimation in a given video sequence. This leads to a temporally continuous 3D pose estimate on unlabelled in-the-wild videos, which may contain occlusions, while exclusively relying on pre-trained 3D pose models. We propose an unsupervised method named TEMP3D that aligns a motion prior model on a given in-the-wild video using existing SOTA single image-based 3D pose estimation methods to give temporally continuous output under occlusions. To evaluate our method, we test it on the Occluded Human3.6M dataset, our custom-built dataset which contains significantly large (up to 100%) human body occlusions incorporated into the Human3.6M dataset. We achieve SOTA results on Occluded Human3.6M and the OcMotion dataset while maintaining competitive performance on non-occluded data. URL: this https URL

Read the original article