arXiv:2407.12899v1 Announce Type: new Abstract: Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene’s subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA and MMCA modules ensure appearance and semantic consistency with reference images and text, respectively. Both modules employ masking mechanisms to prevent subject blending. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.
The article “DreamStory: An Automatic Open-Domain Story Visualization Framework” introduces a novel framework called DreamStory that aims to create visually compelling images or videos based on textual narratives. While existing methods have made progress, they still struggle to generate a coherent sequence of subject-consistent frames solely from a story. DreamStory addresses this challenge by leveraging Language and Vision models (LLMs) and a Multi-Subject consistent Diffusion model (MSD). The framework consists of an LLM that acts as a story director and an MSD that generates consistent multi-subjects across the images. DreamStory uses the LLM to generate descriptive prompts for subjects and scenes, annotating each scene’s subjects for subsequent subject-consistent generation. It then utilizes these detailed subject descriptions to create portraits of the subjects, which serve as multimodal anchors. The MSD employs Masked Mutual Self-Attention and Masked Mutual Cross-Attention modules to ensure appearance and semantic consistency with reference images and text. Experiments validate the effectiveness of DreamStory, and a benchmark, DS-500, is established to assess the overall performance of the framework.

Exploring DreamStory: A New Approach to Story Visualization

The field of story visualization has made significant progress in recent years, with researchers striving to create visually striking images and videos that accurately represent textual narratives. While diffusion models have shown promise, existing methods still struggle to seamlessly create a coherent sequence of subject-consistent frames based solely on a story. In response to this challenge, DreamStory offers an innovative solution, introducing an automatic open-domain story visualization framework that leverages Language and Vision Models (LLMs) along with a novel multi-subject consistent diffusion model.

The Components of DreamStory

DreamStory comprises two key components:

  1. LLM as the Story Director: The LLM plays a crucial role in DreamStory by generating descriptive prompts aligned with the story. These prompts help create subjects and scenes that are coherent with the narrative, ensuring subject-consistent generation in subsequent stages.
  2. Multi-Subject Consistent Diffusion Model (MSD): The MSD, an innovative addition to DreamStory, utilizes detailed subject descriptions provided by the LLM to generate portraits of the subjects. These portraits, along with corresponding textual information, serve as multimodal anchors or guidance. The MSD uses these anchors to generate story scenes with consistent multi-subject representation.

Incorporating this multi-subject consistency is crucial in story visualization, as it enhances the overall coherence and immersion experienced by the viewer. Without subject consistency, the visual representation may become disjointed or confusing, hindering the storytelling aspect of the visualization.

Making Use of Masked Mutual Self-Attention and Cross-Attention

The MSD component of DreamStory incorporates Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules to ensure appearance and semantic consistency in the generated visuals.

The MMSA module helps maintain appearance consistency by leveraging reference images. This module ensures that the generated visuals resemble the reference images in terms of appearance, preventing any abrupt changes or discrepancies. Consequently, the generated frames smoothly transition while maintaining visual cohesiveness.

The MMCA module, on the other hand, focuses on semantic consistency with reference texts. By incorporating the textual information provided alongside the subject portraits, DreamStory ensures that the generated visuals adhere to the intended semantic context. This module ensures that the visuals accurately represent the textual descriptions, enriching the viewer’s understanding of the story.

Both modules employ masking mechanisms that prevent subject blending. This approach ensures that each subject retains its distinct characteristics and does not overlap with other subjects, achieving a visually pleasing and coherent composition.

Evaluating DreamStory’s Performance

In order to validate DreamStory’s effectiveness and encourage further advancements in story visualization, a benchmark called DS-500 has been established. This benchmark assesses DreamStory’s overall performance, subject-identification accuracy, and the consistency of the generation model.

Extensive experiments have been conducted to evaluate DreamStory using both subjective and objective measures. The results have demonstrated the efficacy of DreamStory in creating visually engaging and subject-consistent story visualizations. These findings contribute to the ongoing progress in the field and pave the way for future innovations in story visualization techniques.

If you are interested in learning more about DreamStory and exploring its capabilities, please visit our project homepage at https://dream-xyz.github.io/dreamstory.

The paper titled “DreamStory: An Automatic Open-Domain Story Visualization Framework” introduces a novel approach to story visualization using language models and a multi-subject consistent diffusion model. Story visualization involves generating visually appealing images or videos that correspond to textual narratives. While there have been advancements in diffusion models for this task, existing methods struggle to create a coherent sequence of subject-consistent frames solely based on a story.

The proposed framework, DreamStory, consists of two main components: an LLM (Language and Logic Model) acting as a story director and a Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across images. The LLM is responsible for generating descriptive prompts for subjects and scenes aligned with the story. It annotates each scene’s subjects, enabling subsequent subject-consistent generation. This step helps establish a strong foundation for the visual representation of the story.

DreamStory leverages detailed subject descriptions generated by the LLM to create portraits of the subjects. These portraits, along with their corresponding textual information, serve as multimodal anchors or guidance for the generation process. The MSD utilizes these multimodal anchors to generate story scenes with consistent multi-subject. The MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules to ensure appearance and semantic consistency with reference images and text, respectively. The masking mechanisms employed by these modules prevent subject blending, resulting in more coherent and visually consistent story visualizations.

To evaluate the proposed approach and facilitate further research in story visualization, the authors have introduced a benchmark called DS-500. This benchmark assesses the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. The authors have conducted extensive experiments to validate the effectiveness of DreamStory, using both subjective and objective evaluations.

Overall, this paper presents a promising approach to open-domain story visualization by combining language models and a multi-subject consistent diffusion model. The use of multimodal anchors and the incorporation of masking mechanisms in the MSD contribute to generating visually coherent and subject-consistent story scenes. The establishment of a benchmark for evaluation purposes is a valuable contribution to the field, enabling researchers to compare and improve upon existing methods. This work opens up possibilities for further advancements in story visualization and has the potential to enhance various applications such as movie production, video game design, and virtual reality experiences.
Read the original article