Make different images of the same subject without a custom training step
Paper: Training-Free Consistent Text-to-Image Generation
Summary by Adrian Wilkins-Caruana
Generative AI models like ChatGPT and Stable Diffusion can be difficult to control. For example, multiple inferences that use the same input prompt might result in vastly different chat responses or generated images. This lack of consistency limits the applicability of these models for doing useful work. Imagine you were a graphic designer using Stable Diffusion: It would be frustrating if the model altered aspects of a generated image in undesirable ways as you tried to iterate on it.
Some researchers have developed methods to make text-to-image models more consistent, but this comes with significant computational complexity. However, today’s paper introduces a new method that’s both fast and effective at making text-to-image generation more consistent, like in the figure below, where the man, kid, and girl look the same across each generated image.
Unlike other consistent text-to-image methods — which might encode an image into a special token S* or fine-tune generation to be similar to a reference image — the proposed method, ConsiStory, is a training-free approach (meaning you don’t need to train at inference time to achieve consistency of subjects) that works by sharing some of the internal model activations among several simultaneously generated images. This means that all the images, and their corresponding text prompts, are processed together in a batch. This involves several techniques that work together to promote subject consistency while allowing for diversity of environment.
The first technique the authors used is called a subject-driven shared attention block. This is the principal mechanism that lets the trained diffusion model generate the same subject in multiple images. The “shared attention block” is an extension of the vanilla attention block that allows the queries from one image to access the keys of other images being generated in addition to the keys of the same image the queries are generated from (like in vanilla attention). The “subject-driven” part of this technique is a special mask that only lets the queries access keys that relate to the subject of the other images being generated. This means that, for instance, in “A photo of an old man wearing a hat [walking in the park][writing numbers on a blackboard],” the model activations relating to the “park” and the “blackboard” are isolated, while the activations for the “old man wearing a hat” are shared. This process is shown below, where the unmasked (white) patches in the masks (M1, M2, M3) represent the subject.
One issue with subject-driven self-attention is that it’s a bit too good: That is, the subject looks too similar across the different images. This is bad because sometimes you need a bit of variation to account for lighting or perspective differences. To fix this, the researchers used two main approaches: The first, self-attention dropout, is shown in the figure above. For each diffusion denoising step, this dropout randomly masks out some of the subject patches.
The second method is inspired by a relevant finding from related work about how self-attention works: Diffusion models can combine the structure from one image with the appearance of another by injecting the self-attention keys and values from the appearance image with the queries from the structure image. So, to increase the pose diversity of generated images, the queries used during generation are blended with the queries of a separate diffusion inference step, one that doesn’t include the attention-sharing modifications introduced above. This blending is aggressively weighted towards these separate queries at the beginning of the diffusion sampling — since this influences structure more than appearance or detail — but continues to weigh towards the shared queries as sampling progresses.
Now the subjects are shared and there is sufficient pose diversity across the images, but the authors noticed that this still isn’t quite enough since the finer, pixel-level details (e.g., eye color) vary from image to image. This is where a novel method called cross-image feature injection comes in. The first step in this process is to figure out where, say, the left eye is in each image in the batch. This is done by comparing every patch in the image to every other patch in the other images. An existing method called diffusion features (DIFT) scores how similar the other patches are to a given “current patch.” Then, the most similar patch is linearly blended with the current patch. Note that feature injection is only done on patches within the subject mask, not on the background patches. The figure below shows a three-image batch (left), its subject masks (center), and the patchwise mapping determined by patches with the most similar DIFT features (right).
The graph below shows how effective ConsiStory is at generating consistent images. It compares the subject consistency against text similarity for several models; scores that are in the upper-right are optimal. Here, ConsiStory performs best, even with slightly different self-attention dropout values (d). The authors also conducted a user study where participants were shown five ConsiStory images and five images from a competing baseline, and were then asked to pick which model they preferred for subject (visual) and textual consistency. Across 3k responses, ConsiStory was preferred between 56% and 91% of the time across all metrics and baselines.
There’s a lot more I could talk about, such as how ConsiStory’s inference-only approach yields 8–25x faster image generation than the other methods, how it can use a reference image to essentially superimpose a specific subject into a new context, or how it can handle multiple subjects, like the girl, cat, and headphones in the first image above. If this sounds interesting to you, I encourage you to check out the paper’s project page, which shows many more consistent text-to-image examples.