Generating better images without prompt engineering
Paper: Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Summary by Adrian Wilkins-Caruana
When I use text-to-image models, I notice something peculiar about the images they generate. It’s hard to describe, but the images seem like a mishmash of all the images on the internet that match the text prompt. For example, if I were to generate “A visually appealing glass of orange juice,” the model would consolidate its entire understanding of the prompt into some images that are somewhere “in the middle” of its knowledge. I’d describe the images as “bland” or “uninteresting.” What do you think of these images below that were generated by Stable Diffusion?
Today’s paper explores Emu, a model developed by researchers from Meta that’s fine-tuned to create visually appealing images from text. Unlike other models that may produce images that are technically correct but lack aesthetic value, Emu was trained with a small set of high-quality images to make sure it delivers not just on accuracy, but also on visual appeal. This approach resulted in a model that can generate the orange juice image below that is bound to make you salivate.
The key to Emu’s unique image-generating ability is quality over quantity. Emu was first trained on an internet-scale dataset of 1.1b image-text pairs; this approach isn’t really novel, and neither were the images that this initial model generated. But the authors paid careful attention to the “quality-tuning” stage: They curated a dataset of just 2k particularly beautiful image-text pairs. For example, the images below compare a rowboat generated by the pre-trained model (left) and by the final fine-tuned model, Emu (right). Model fine-tuning isn’t a new technique either, but the authors claim the careful curation of the fine-tuning dataset is what gives Emu its high-fidelity generation abilities.
It’s worthwhile to look at exactly how the authors generated this fine-tuning dataset. First, they filtered the 1.1b images down to a few hundred million by removing images with text in them and ones that don’t align very well with their text pairing. Also, being Meta, they further filtered these images down to around 200k using proprietary signals, like the number of likes. Next, a set of human annotators filtered the set down to 20k images, with the goal of removing poor-quality images that made it through the prior filterings. Finally, a set of specialist annotators whittled the set down to 2k images, with the goal of selecting only the very best images according to their composition, lighting, color and contrast, and subject and background. The authors then composed high-quality annotations for each of these final 2k images. To put it in perspective, the fine-tuning dataset is 0.0002% the size of the pre-training dataset!
The authors then assessed Emu based on the visual appeal and text faithfulness of its generated images as judged by human annotators. For each assessment, annotators chose which image in an image pair they preferred, if any. They liked the final, fine-tuned Emu model more than the pre-trained (not yet fine-tuned) version of Emu 87% of the time for visual appeal. Images from the fine-tuned Emu model tied or improved on text faithfulness 80% of the time over images from the pre-trained Emu model. Also, when compared to another popular model called SDXLv1.0, annotators preferred Emu 60-80% of the time on visual appeal, depending on which prompt set was used for testing.
Those numbers give us a quantitative measure of improvement. But this is a model where seeing is believing — we can directly see the quality in these Emu-generated images:
It’s already impressive that Emu can be so good at generating visually appealing images with merely 2k examples, but the authors show that they could get away with fewer! While the pre-trained (not-yet-fine-tuned) Emu model only wins against SDXLv1.0 24% of the time, an Emu model fine-tuned with just 100 high-fidelity images wins 60% of the time. This result highlights a new avenue of model improvement, and opens novel possibilities for targeted generative models, such as for advertising, education, or design.