Text or Images, Input or Output GILL, an innovative approach to multimodal model training

Published

Jan 10, 2024

Reading time

3 min read

GPT-4V introduced a large multimodal model that generates text from images and, with help from DALL-E 3, generates images from text. However, OpenAI hasn’t fully explained how it built the system. A separate group of researchers described their own method.

What's new: Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov at Carnegie Mellon University proposed Generating Images with Large Language Models (GILL), a training method that enables a large language model and a text-to-image generator to use both text and images as either input or output. Given text and/or image input, it decides whether to retrieve existing images or generate new ones.

Key insight: Models like CLIP and ImageBind map text and image inputs to a similar embedding space, so closely related text and images have similar embeddings. This approach enables a large multimodal model to process both data types. Text outputs, too, can be mapped to the same embedding space, so an image decoder, such as a diffusion model, can use them to produce images or an image retriever to retrieve images.

How it works: The authors used a pretrained OPT large language model, ViT-L image encoder (taken from CLIP), and pretrained Stable Diffusion text-to-image generator. The authors trained ViT-L to map its embeddings to those produced by OPT. They trained OPT to recognize prompts that request an image and enabled the system to either generate or retrieve images. Finally, a separate linear classifier learned whether to retrieve or generate images.

The authors froze the ViT-L, added a linear layer, and trained it as follows: Given an image, the ViT-L-plus-linear-layer produced an image embedding, as usual. Given the image embedding and the first part of the corresponding caption, OPT iteratively tried to predict the next word. The linear layer learned how to modify the embedding so OPT could complete the caption. This enabled OPT to take images as input.
They added 8 tokens to OPT’s vocabulary and trained the model to emit them at the end of every image caption — a signal that an image should be either retrieved or generated. (Typically a single token is sufficient to denote the end of a caption. However, these tokens corresponded to embeddings that, later, would be used to generate an image, and the authors found that a single token was not sufficiently expressive.)
Then they enabled Stable Diffusion to produce an image when OPT generated the 8 new tokens. They trained a separate transformer to map OPT’s embeddings associated with the 8 tokens (that is, embeddings produced by the layer before the one that generated the tokens) to those produced by Stable Diffusion’s text encoder.
Next they enabled the system to retrieve images when OPT generated the 8 tokens. They added linear layers to ViT-L and OPT and trained them to map the ViT-L’s embeddings to the OPT embedding associated with the first token. Specifically, the linear layers learned to minimize the difference between their outputs.
The authors trained a linear classifier, given the 8 OPT embeddings associated with the tokens, to decide whether to retrieve or generate an image. To build the classifier’s training set, they selected captions from a collection of diverse human-written prompts and, for each one, both generated an image and retrieved the most similar image from CC3M. 5 human judges selected the image that best matched the prompt. This process yielded 900 examples annotated according to whether the image was retrieved or generated.
At inference, OPT generated tokens and fed the associated embeddings directly to the classifier, which activated the pipeline for either the generation or retrieval.

Results: VIST is a dataset of 20,000 visual stories, each of which comprises five captioned images. The authors evaluated GILL’s and Stable Diffusion’s abilities, given the final caption or all five captions, to generate the final image in each story based on CLIP similarity scores between generated and ground-truth images. Given one caption, GILL achieved 0.581 similarity and Stable Diffusion achieved 0.592 similarity. Given five captions, GILL achieved 0.612 similarity and Stable Diffusion scored 0.598 similarity, highlighting GILL’s ability to use the context afforded by more extensive input. It did even better (0.641 similarity) given both captions and images, which Stable Diffusion couldn’t handle. The authors also evaluated how well their system retrieved the correct last image from VIST given the 5 captions and the first 4 images. GILL retrieved the correct image 20.3 percent of the time, while their own FROMAGe retrieved the correct image 18.2 percent of the time. In comparison, CLIP, given the 5 captions (without the images), retrieved the correct image 8.8 percent of the time.

Why it matters: Models that wed text and images are advancing rapidly. GILL and other recent models extend single-image input and/or output to any combination of images and text. This capability — which GILL achieves by mapping embeddings of image and text to one another — gives the models more context to generate more appropriate output.

We’re thinking: The authors add an interesting twist: Rather than generating images, the system can choose to retrieve them. Sometimes an existing image will do.

Subscribe to The Batch