arXiv:2402.12760v1 Announce Type: new
Abstract: Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics.

Automated Prompt Optimization for Text-to-Image Models

In this article, the authors propose a novel framework called User-Friendly Fine-Grained Text Generation (UF-FGTG) for automated prompt optimization in text-to-image models. They address the challenge of novice users achieving the desired results when manually entering prompts by bridging the gap between user input behavior and model training datasets.

A multi-disciplinary approach is employed in this research, combining concepts from multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. By leveraging these concepts, the authors aim to improve the generation of visually appealing and diverse images.

Constructing the Coarse-Fine Granularity Prompts Dataset

The authors first construct a novel dataset called Coarse-Fine Granularity Prompts (CFP) specifically for text-to-image tasks. This dataset combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. This approach allows for high-level guidance while ensuring that user preferences are taken into account.

User-Friendly Fine-Grained Text Generation Framework

The UF-FGTG framework proposed in this research provides an automated solution to translate user-input prompts into model-preferred prompts. This framework includes a prompt refiner that continually rewrites prompts to empower users in selecting results that align with their unique needs.

To generate prompts that are preferred by the model, the authors integrate image-related loss functions from the text-to-image model into the training process of text generation. This ensures that the generated prompts are optimized for model performance.

Furthermore, an adaptive feature extraction module is proposed to ensure diversity in the generated results. This helps enhance visual appeal and prevents repetitive or similar images from being generated.

Impact and Implications

This research has significant implications for the field of multimedia information systems. By automating prompt optimization in text-to-image models, it streamlines the process of generating visually appealing and diverse images. This can have applications in fields such as graphic design, advertising, and entertainment where high-quality visuals are crucial.

The concepts of animations, artificial reality, augmented reality, and virtual realities are closely related to this research. Animations and virtual realities require realistic and visually engaging visuals, which can be achieved through improved text-to-image generation. Artificial reality and augmented reality can benefit from more diverse and visually appealing images, enhancing user experiences in these simulated environments.

In conclusion, the authors’ UF-FGTG framework presents a promising solution to automated prompt optimization in text-to-image models. By leveraging multi-disciplinary concepts and constructing the CFP dataset, this research contributes to the wider field of multimedia information systems and has implications for various domains relying on high-quality visuals.

Read the original article