CAC: Localized Text-to-Image Generation For Free via Cross Attention Control

Localized Text-to-Image Generation For Free via Cross Attention Control

Carnegie Mellon University

Abstract

Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cross attention maps during inference. With no additional training, model architecture modification or inference time, our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models. CAC also enhances models that are already trained for localized generation when deployed at inference time. Furthermore, to assess localized text-to-image generation performance automatically, we develop a standardized suite of evaluations using large pretrained recognition models. Our experiments show that CAC improves localized generation performance with various types of location information ranging from bounding boxes to semantic segmentation maps, and enhances the compositional capability of state-of-the-art text-to-image generative models.

Demo

Examples of generated images with a variety of different types of user inputs and applications.

CAC is versatile and can be applied to various application scenarios.

Examples of generated images with a variety of different types of user inputs and applications.

@article{he2023localized, title={Localized Text-to-Image Generation for Free via Cross Attention Control}, author={Yutong He and Ruslan Salakhutdinov and J. Zico Kolter}, journal={arXiv preprint arXiv:2306.14636}, year={2023} }

Localized Text-to-Image Generation For Free via Cross Attention Control

CAC as a plugin to existing methods for localized text-to-image generation. CAC improves upon diverse types of localization (bounding boxes, semantic segmentation maps and localized styles) with different base models (Stable Diffusion and GLIGEN).

Abstract

Proposed Method

The illustration of CAC for localized generation. CAC uses localized text descriptions and spatial constraints to manipulate the cross attention maps.

Demo

Examples of generated images with a variety of different types of user inputs and applications.

CAC is versatile and can be applied to various application scenarios.

Examples of generated images with a variety of different types of user inputs and applications.

Generating based on bounding box information.

Examples and baseline comparison of generated images based on bounding box information.

Generating based on compositional prompts.

Examples and baseline comparison of compositional generation.

Generating based on semantic segmentation map information.

Examples and baseline comparison of generated images based on semantic segmentation map information.

Ablation Study: The Fidelity-Controllability Trade-offs

Comparison of the fidelity-controllability trade-offs with and without CAC. With CAC applied, the model is able to reach a sweet spot where the generated images appear more realistic while maintaining better consistency with the bounding box information.

BibTeX