Diffusion models have shown impressive performance in various domains, but their ability to follow natural language instructions and generate complex scenes is still lacking. Prior works have used reinforcement learning to enhance this capability, but it requires careful reward design and often fails to incorporate rich natural language feedback. In this article, we introduce a novel algorithm called iterative prompt relabeling (IP-RLDF) that aligns images to text through iterative image sampling and prompt relabeling. By sampling a batch of images conditioned on the text and relabeling the text prompts of unmatched pairs with classifier feedback, IP-RLDF significantly improves the models’ image generation following instructions. We conducted thorough experiments on three different models and achieved up to 15.22% improvement on the spatial relation VISOR benchmark, outperforming previous RL methods. Explore this article to learn more about the advancements in diffusion models and the effectiveness of IP-RLDF in generating images based on natural language instructions.
Abstract:Diffusion models have shown impressive performance in many domains, including image generation, time series prediction, and reinforcement learning. The algorithm demonstrates superior performance over the traditional GAN and transformer based methods. However, the model’s capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. This has been an important research area to enhance such capability. Prior works adopt reinforcement learning to adjust the behavior of the diffusion models. However, RL methods not only require careful reward design and complex hyperparameter tuning, but also fails to incorporate rich natural language feedback. In this work, we propose iterative prompt relabeling (IP-RLDF), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. IP-RLDF first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on three different models, including SDv2, GLIGEN, and SDXL, testing their capability to generate images following instructions. With IP-RLDF, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods.