arXiv:2412.16495v1 Announce Type: cross
Abstract: Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.

Multi-Character Video Generation: A Novel Approach for Realistic Scenarios

In the field of multimedia information systems, the generation of text-editable and pose-controllable character videos is a challenging but important topic. With practical applications in areas such as virtual reality and augmented reality, the ability to generate dynamic and realistic multi-character videos can greatly enhance user experiences. However, existing approaches have mainly focused on single-object video generation with pose guidance, overlooking the realistic scenario where multiple characters appear concurrently.

To address this limitation, the authors propose a novel multi-character video generation framework that allows for the simultaneous generation of multiple characters in a tuning-free manner. The framework is based on the separation of text and pose guidance, enabling precise control over each character’s appearance and movements. The key contributions of the proposed framework lay in the extraction of character masks from pose sequences to identify spatial positions, the use of Language Latent Models (LLMs) for precise text guidance, and the introduction of spatial-aligned cross attention and multi-branch control modules to generate fine-grained controllable multi-character videos.

The interdisciplinary nature of this research is evident as it combines concepts from various fields such as computer vision, natural language processing, and graphics. By integrating these different disciplines, the framework is able to generate highly realistic multi-character videos that can be tailored to specific scenarios and personalized preferences.

In the wider field of multimedia information systems, this research contributes to the advancement of animation techniques, artificial reality, augmented reality, and virtual realities. The ability to generate multi-character videos with precise controllability opens up new possibilities for immersive storytelling, virtual training environments, and interactive applications. This research also aligns with the growing demand for dynamic and realistic multimedia content in entertainment, education, and virtual simulations.

The results of the proposed approach are visually impressive, showcasing the precise controllability and realism of the generated multi-character videos. Additionally, the quantitative results demonstrate that this approach outperforms previous works in terms of performance. This is a significant achievement, as it indicates the effectiveness and generalizability of the proposed framework.

In conclusion, the proposed multi-character video generation framework represents a significant advancement in the field of multimedia information systems. By addressing the challenge of generating realistic multi-character videos, this research opens up new possibilities for immersive and interactive multimedia experiences in various domains. The interdisciplinary nature of the concepts involved further highlights the importance of integrating different fields to achieve groundbreaking results. Moving forward, further research can explore the application of this framework in real-world scenarios and investigate its potential in areas such as gaming, virtual reality storytelling, and virtual training simulations.

Read the original article