arXiv:2405.14040v1 Announce Type: new
Abstract: Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip’s duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released.
Synchronized Video Storytelling: Generating Informative Narrations for Videos
Video storytelling is a captivating form of multimedia content that combines visual scenes with narration to engage the audience. However, creating synchronized narrations for recorded visual scenes can be a challenging task. Previous studies have made progress in the areas of dense video captioning and video story generation, but these methods do not necessarily provide synchronized narrations for ongoing visual scenes.
In this groundbreaking work, the researchers introduce a new task called Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations should effectively relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the duration of each video clip. To ensure coherence and integrity, a structured storyline is introduced to guide the generation process.
To enable the exploration of this novel task, the researchers also introduce the E-SyncVidStory dataset, which comes with rich annotations. This dataset will serve as a benchmark for future research in the field of synchronized video storytelling.
It is noted that existing Multimodal Language and Vision Models (LLMs) are not effective in addressing this task in one-shot or few-shot settings. To overcome this challenge, the researchers propose a framework called VideoNarrator. This framework is designed to generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline.
Additionally, a comprehensive set of evaluation metrics is introduced to assess the effectiveness of the generation process. Automatic and human evaluations are conducted, both of which validate the efficacy of the proposed approach.
Overall, this research presents a significant advancement in the field of multimedia information systems, specifically in the areas of video storytelling, animations, artificial reality, augmented reality, and virtual realities. The multi-disciplinary nature of these concepts is evident in the task of synchronized video storytelling, which requires a deep understanding of both visual content and language generation. The proposed framework, VideoNarrator, can serve as a foundation for further advancements in generating informative narrations for videos. The release of the E-SyncVidStory dataset, along with the accompanying codes and evaluations, will undoubtedly facilitate future research in this exciting domain.