In the weakly supervised temporal video grounding study, previous methods use
predetermined single Gaussian proposals which lack the ability to express
diverse events described by the sentence query. To enhance the expression
ability of a proposal, we propose a Gaussian mixture proposal (GMP) that can
depict arbitrary shapes by learning importance, centroid, and range of every
Gaussian in the mixture. In learning GMP, each Gaussian is not trained in a
feature space but is implemented over a temporal location. Thus the
conventional feature-based learning for Gaussian mixture model is not valid for
our case. In our special setting, to learn moderately coupled Gaussian mixture
capturing diverse events, we newly propose a pull-push learning scheme using
pulling and pushing losses, each of which plays an opposite role to the other.
The effects of components in our scheme are verified in-depth with extensive
ablation studies and the overall scheme achieves state-of-the-art performance.
Our code is available at https://github.com/sunoh-kim/pps.

In the weakly supervised temporal video grounding study, previous methods have used single Gaussian proposals to represent events described by a sentence query. However, these proposals lack the ability to express diverse events. To address this limitation, we propose a Gaussian mixture proposal (GMP) that can depict arbitrary shapes by learning the importance, centroid, and range of each Gaussian in the mixture.

Traditionally, the learning of Gaussian mixture models is done in a feature space. However, in our case, each Gaussian is implemented over a temporal location, making feature-based learning invalid. This highlights the multi-disciplinary nature of our approach, as it requires a combination of temporal modeling and Gaussian mixture modeling.

To effectively learn the moderately coupled Gaussian mixture and capture diverse events, we introduce a pull-push learning scheme. This scheme uses both pulling and pushing losses, where each loss plays an opposite role to the other. By incorporating these losses into the learning process, we are able to enhance the expressiveness of our proposal model.

We validate the effectiveness of our scheme through extensive ablation studies, where we examine the impact of different components in our proposed method. These studies provide in-depth insights into the performance of our approach and help us understand the contributions of each component.

The results of our experiments demonstrate that our overall scheme achieves state-of-the-art performance in weakly supervised temporal video grounding. This not only highlights the effectiveness of our approach but also showcases the potential for further advancements in this field.

To facilitate reproducibility and further research, we provide our code on GitHub at https://github.com/sunoh-kim/pps. Researchers and practitioners interested in the weakly supervised temporal video grounding can access our codebase to gain a deeper understanding of our methodology.

Read the original article