Pictures From Words and Gestures AI model generates captions as users mouse over images.

Published

Mar 24, 2021

Reading time

2 min read

A new system combines verbal descriptions and crude lines to visualize complex scenes.

What’s new: Google researchers led by Jing Yu Koh proposed Tag-Retrieve-Compose-Synthesize (TReCS), a system that generates photorealistic images by describing what they want to see while mousing around on a blank screen.

Key insight: Earlier work proposed a similar system to showcase a dataset, Localized Narratives, that comprises synchronized descriptions and mouse traces captured as people described an image while moving a cursor over it. That method occasionally produced blank spots. The authors addressed that shortcoming by translating descriptive words into object labels (rather than simply matching words with labels) and distinguishing foreground from background.

How it works: The Local Narratives dataset provides an inherent correspondence between every word in a description and a mouse trace over an image. TReCS uses this correspondence to translate words into labels of objects that populate a scene. The authors trained the system on the portion of Localized Narratives that used images in COCO and tested it on the portion that used Open Images.

Given a description, a BERT model assigned an object label to each word in the description. The authors obtained ground-truth labels by matching the mouse traces for each word to object segmentation masks (silhouettes) for the images described. Then they fine-tuned the pretrained BERT to, say, attach the label “snow” to each of the words in “skiing on the snow.”
For each label assigned by BERT, the system chose a mask from a similar image (say, a photo taken in a snowy setting). The authors trained a cross-modal dual encoder to maximize the similarity between a description and the associated image, and to minimize the similarity between that description and other images. On inference, given a description, the authors used the resulting vectors to select the five most similar training images.
The system used these five images differently for foreground and background classes (an attribute noted in the mask dataset). For foreground classes such as “person,” it retrieved the masks with the same label and chose the one whose shape best matched the label’s corresponding mouse trace. For background classes such as “snow,” it chose all of the masks from the image whose masks best matched the labels and combined shape of the corresponding mouse traces.
The authors arranged the masks on a blank canvas in the locations indicated by the mouse traces. They positioned first background and then foreground masks, reversing the order in which they were described. This put the first-mentioned object in front.
A generative adversarial network learned to generate realistic images from the assembled masks.

Results: Five judges compared TReCS’ output with that of AttnGAN, a state-of-the-art, text-to-image generator that did not have access to mouse traces. The judges preferred TReCS’ image quality 77.2 percent to 22.8 percent. They also preferred the alignment of TReCS output with the description, 45.8 percent to 40.5 percent. They rated both images well aligned 8.9 percent of the time and neither image 4.8 percent of the time.

Why it matters: The authors took advantage of familiar techniques and datasets to extract high-level visual concepts and fill in the details in a convincing way. Their method uncannily synthesized complex scenes from verbal descriptions (though the featured example, a skier standing on a snowfield with trees in the background, lacks the railing and mountain mentioned in the description).

We’re thinking: Stock photo companies may want to invest in systems like this. Customers could compose photos via self-service rather than having to choose from limited options. To provide the best service, they would still need to hire photographers to produce raw material.

Subscribe to The Batch