Interpreting Image Edit Instructions Meta’s Emu Edit improves text-to-image generation with task classification.

May 22, 2024
Reading time
3 min read
Interpreting Image Edit Instructions: Meta’s Emu Edit improves text-to-image generation with task classification.

The latest text-to-image generators can alter images in response to a text prompt, but their outputs often don’t accurately reflect the text. They do better if, in addition to a prompt, they’re told the general type of alteration they’re expected to make.

What’s new: Developed by Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar and colleagues at Meta, Emu Edit enriches prompts with task classifications that help the model interpret instructions for altering images. You can see examples here.

Key insight: Typical training datasets for image-editing models tend to present, for each example, an initial image, an instruction for altering it, and a target image. To train a model to interpret instructions in light of the type of task it describes, the authors further labeled examples with a task. These labels included categories for regional alterations such as adding or removing an object or changing the background, global alterations such as changing an image’s style, and computer-vision tasks such as detecting or segmenting objects.  

How it works: Emu Edit comprises a pretrained Emu latent diffusion image generator and pretrained/fine-tuned Flan-T5 large language model. The system generates a novel image given an image, text instruction, and one of 16 task designations. The authors generated the training set through a series of steps and fine-tuned the models on it.

  • The authors prompted a Llama 2 large language model, given an image caption from an unspecified dataset, to generate (i) an instruction to alter the image, (ii) a list of which objects to be changed or added, and (iii) a caption for the altered image. For example, given a caption such as, “Beautiful cat with mojito sitting in a cafe on the street,” Llama 2 might generate {"edit": "include a hat", "edited object": "hat", "output": "Beautiful cat wearing a hat with mojito sitting in a cafe on the street"}.
  • Given Llama 2’s output, the Prompt-to-Prompt image generator produced initial and target images. 
  • The authors modified Prompt-to-Prompt with unique enhancements for each task. For instance, to alter only parts of an image, Prompt-to-Prompt usually computes and applies a mask to the initial image while generating the target image. The authors noted that the masks tend to be imprecise if original and target captions differ by more than simple word substitutions. To address this, they modified the method for computing masks. In the change-an-object task, a multi-step procedure involving SAM and Grounding DINO (a transformer trained for object detection, unrelated to DINO, the vision transformer from Meta) generated a mask of the list of objects to be changed.
  • Following the typical diffusion process for generating images, Emu learned to remove noise from noisy versions of the target images, given the initial image, the instruction, and the task label. 
  • The authors fine-tuned Flan-T5. Given a generated instruction, Flan-T5 learned to classify the task. At inference, given the instruction, Flan-T5 provided the task to Emu Edit.

Results: Judges compared altered images produced by the authors’ method, InstructPix2Pix, and MagicBrush using the MagicBrush test set. Evaluating how well the generated images aligned with the instruction, 71.8 percent of the time, the judges preferred Emu Edit over InstructPix2Pix, and 59.5 percent of the time, they preferred Emu Edit over MagicBrush. Evaluating how well the generated images preserve elements from the input images, 71.6 percent preferred Emu Edit over InstructPix2Pix, and 60.4 percent preferred Emu Edit over MagicBrush.

Why it matters: Richer data improves machine learning results. Specifying tasks and generating images that reflect them improved Emu Edit’s data compared to other works, enabling it to achieve better results. 

We’re thinking: Text-to-image generators are amazing and fun to use, but their output can be frustratingly unpredictable. It’s great to see innovations that make them more controllable.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox