Text-to-image generators like DALL·E 2, Stable Diffusion, and Adobe’s new Generative Fill feature can revise images in a targeted way — say, change the fruit in a bowl from oranges to bananas — if you enter a few words that describe the change plus an indication of the areas to be changed. Others require a revised version of the prompt that produced (or could produce) the original image. A new approach performs such revisions based solely on a brief text command.
What's new: Tim Brooks and colleagues at UC Berkeley built InstructPix2Pix, a method that fine-tunes a pretrained text-to-image model to revise images via simple instructions like “swap oranges with bananas” without selecting the area that contained oranges. InstructPix2Pix works with traditional artwork (for which there is no initial prompt) as well as generated images.
Key insight: If you feed an image plus an edit instruction into a typical pretrained image generator, the output may contain the elements you desire but it’s likely to look very different. However, you can fine-tune a pretrained image generator to respond coherently to instructions using a dataset that includes a prompt, an image generated from that prompt, a revised version of the prompt, a corresponding revised version of the image, and an instruction that describes the revision. Annotating hundreds of thousands of images in this way could be expensive, but it’s possible to synthesize such a dataset: (i) Start with a corpus of images and captions, which stand in for prompts. (ii) Use a pretrained large language model to generate revised prompts and instructions. (iii) Then use a pretrained image generator to produce revised images from the revised prompts.
How it works: The authors fine-tuned Stable Diffusion, given an input image and an instruction, to revise the image accordingly. They built the fine-tuning dataset using the GPT-3 language model, Stable Diffusion text-to-image generator, and Prompt-to-Prompt, an image generator that revises generated images based on a revised version of the initial prompt (no masking required). Images and captions (used as prompts) came from LAION-Aesthetics V2 6.5+.
- The authors sampled 700 captions (for example, “a girl riding a horse”). They manually added 700 instructions (“have her ride a dragon”) and revised prompts (“a photograph of a girl riding a dragon”). Using this data, they fine-tuned GPT-3 to take a caption and generate a revised prompt and corresponding instruction.
- The authors selected around 455,000 LAION captions outside of the initial 700 and used them to prompt Stable Diffusion to produce an initial image. They also fed the prompts to GPT-3, which generated revised prompts and corresponding instructions. Given the initial images and revised prompts, Prompt-to-Prompt generated revised images.
- They generated 100 variations of each revised image and kept the one that best reflected the initial image and the instruction according to a similarity metric based on CLIP, which maps corresponding text-image pairs to the same representations. The metric compares the vector difference between CLIP’s representations of the initial and revised prompts to the vector difference between CLIP’s representations of the initial and revised images. The two vectors should point in the same direction. This process yielded a fine-tuning set of around 455,000 sets of initial images, revised images, and instructions.
- The dataset enabled the authors to fine-tune Stable Diffusion to produce an edited image from an initial image and instruction.
Results: Qualitatively, InstructPix2Pix revised the initial images appropriately with respect to subject, background, and style. The authors compared InstructPix2Pix to SDEdit, which revises images based on detailed prompts, according to the vector-difference method they used to choose revised images for the fine-tuning set. Revising an undisclosed set of images, InstructPix2Pix achieved a higher similarity of ~0.15, while SDEdit achieved ~0.1. (The score represents similarity between the difference in the initial and revised prompts and the difference in the initial and revised images.)
Why it matters: This work simplifies — and provides more coherent results when — revising both generated and human-made images. Clever use of pre-existing models enabled the authors to train their model on a new task using a relatively small number of human-labeled examples.
We're thinking: Training text generators to follow instructions improved their output substantially. Does training an image generator to follow instructions have a similar impact?