arXiv:2406.18583v1 Announce Type: new Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.
The article “Lumina-Next: A Unified Framework for Noise Transformation” introduces Lumina-Next, an improved version of Lumina-T2X, a family of Flow-based Large Diffusion Transformers. Lumina-Next addresses challenges faced by Lumina-T2X, such as training instability, slow inference, and extrapolation artifacts. The authors present the Next-DiT architecture with 3D RoPE and sandwich normalizations as an improved version of the Flag-DiT architecture. They also propose Frequency- and Time-Aware Scaled RoPE for better resolution extrapolation in text-to-image generation. The article further introduces a sigmoid time discretization schedule to reduce sampling steps and the Context Drop method for faster network evaluation. Lumina-Next not only enhances the quality and efficiency of text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation. The authors validate Lumina-Next by applying it to various tasks, including visual recognition, multi-view, audio, music, and point cloud generation, showcasing its strong performance across domains. The release of all codes and model weights aims to advance the development of next-generation generative AI.
Lumina-Next: Advancements in Transforming Noise into Various Modalities
The nascent family of Flow-based Large Diffusion Transformers, known as Lumina-T2X, has shown great potential in transforming noise into different modalities conditioned on text instructions. However, it still faces challenges in terms of training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X that overcomes these challenges and offers enhanced generation performance with improved training and inference efficiency.
The Flag-DiT Architecture: Analyzing Suboptimal Components
As a starting point, we conducted a comprehensive analysis of the Flag-DiT architecture utilized in Lumina-T2X. Through this analysis, we identified several suboptimal components that were hindering the performance of the model. To address these issues, we introduced the Next-DiT architecture, which incorporates modifications such as 3D RoPE and sandwich normalizations.
Better Resolution Extrapolation with Frequency- and Time-Aware Scaled RoPE
One of the key challenges in text-to-image generation is achieving better resolution extrapolation. To tackle this challenge, we compared different context extrapolation methods in combination with text-to-image generation using 3D RoPE. Based on our comparisons, we proposed Frequency- and Time-Aware Scaled RoPE, a novel approach tailored for diffusion transformers. This method significantly enhances resolution extrapolation capabilities, allowing for more detailed and realistic image generation.
Improving Training Efficiency and Inference Speed
In addition to enhancing generation performance, Lumina-Next addresses the issues of training instability and slow inference. We introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE, resulting in faster training. Furthermore, we implemented the Context Drop method to merge redundant visual tokens, leading to faster network evaluation and improved overall sampling speed.
Beyond Text-to-Image Generation: Lumina-Next’s Versatility and Performance
Lumina-Next is not limited to basic text-to-image generation. We demonstrate its versatility and strong performance in various domains, including visual recognition, multi-view generation, audio generation, music generation, and point cloud generation. By applying Lumina-Next to these tasks, we showcase its capabilities and solidify its position as a versatile and powerful generative framework.
Advancing the Development of Next-Generation Generative AI
We believe in the importance of collaboration and knowledge sharing in the field of AI development. As a result, we are releasing all codes and model weights related to Lumina-Next, aiming to contribute to the advancement of next-generation generative AI and universal modeling. By providing access to these resources, we hope to inspire further innovation and exploration in the field.
In conclusion, Lumina-Next represents a significant step forward in the transformation of noise into various modalities. Its improvements in generation performance, training efficiency, inference speed, and versatility make it a promising framework for generative AI. We invite researchers and developers to explore Lumina-Next and contribute to the ongoing progress in this field.
The paper titled “Lumina-Next: Advancements in Flow-based Large Diffusion Transformers” introduces an improved version of the Lumina-T2X model, called Lumina-Next. Lumina-T2X is a family of Flow-based Large Diffusion Transformers that can transform noise into different modalities, such as images and videos, conditioned on text instructions. Although Lumina-T2X shows promising capabilities, it faces challenges like training instability, slow inference, and extrapolation artifacts.
To address these challenges, the authors propose Lumina-Next, which exhibits stronger generation performance while improving training and inference efficiency. They conduct a comprehensive analysis of the Flag-DiT architecture used in Lumina-T2X and identify suboptimal components. They introduce the Next-DiT architecture, which incorporates 3D RoPE (Rotational Positional Encoding) and sandwich normalizations to address these suboptimal components.
Furthermore, the authors focus on enhancing resolution extrapolation, which is the ability to generate high-resolution images from low-resolution prompts. They compare different context extrapolation methods, specifically applied to text-to-image generation with 3D RoPE. They propose Frequency- and Time-Aware Scaled RoPE, tailored for diffusion transformers, to enable better resolution extrapolation.
Additionally, the authors introduce a sigmoid time discretization schedule to reduce the number of sampling steps required to solve the Flow ODE (Ordinary Differential Equation). They also propose the Context Drop method, which merges redundant visual tokens, leading to faster network evaluation and an overall boost in sampling speed.
The improvements made in Lumina-Next not only enhance the quality and efficiency of basic text-to-image generation but also demonstrate superior resolution extrapolation capabilities and multilingual generation. The authors achieve multilingual generation by using decoder-based Language Models (LLMs) as the text encoder, enabling Lumina-Next to generate images based on text instructions in multiple languages, all in a zero-shot manner.
To showcase the versatility of Lumina-Next as a generative framework, the authors instantiate it on diverse tasks, including visual recognition, multi-view generation, audio generation, music generation, and point cloud generation. The results across these domains demonstrate strong performance, highlighting the broad applicability of Lumina-Next.
In an effort to advance the development of next-generation generative AI, the authors release all codes and model weights associated with Lumina-Next. This open-source approach aims to foster collaboration and further advancements in the field of universal modeling.
Overall, Lumina-Next presents significant advancements over its predecessor, addressing key challenges and improving the quality, efficiency, and versatility of generative AI. Its improved generation performance, resolution extrapolation capabilities, and multilingual generation make it a promising framework for various applications, while the release of codes and model weights encourages further research and development in the field.
Read the original article