Better Images, Less Training Würstchen, a speedy, high-quality image generator

Published

Feb 14, 2024

Reading time

3 min read

The longer text-to-image models train, the better their output — but the training is costly. Researchers built a system that produced superior images after far less training.

What's new: Independent researcher Pablo Pernías and colleagues at Technische Hochschule Ingolstadt, Université de Montréal, and Polytechnique Montréal built Würstchen, a system that divided the task of image generation between two diffusion models.

Diffusion model basics: During training, a text-to-image generator based on diffusion takes a noisy image and a text embedding. The model learns to use the embedding to remove the noise in successive steps. At inference, it produces an image by starting with pure noise and a text embedding, and removing noise iteratively according to the text embedding. A variant known as a latent diffusion model uses less processing power by removing noise from a noisy image embedding instead of a noisy image.

Key insight: A latent diffusion model typically learns to remove noise from an embedding of an input image based solely on a text prompt. It can learn much more quickly if, in addition to the text prompt, a separate diffusion model supplies a smaller, noise-free version of the image embedding. During training, the two models can be trained separately, enabling them to learn their tasks in a fraction of the usual time. At inference, the models can work efficiently as a stack: one to generate smaller embeddings and the other to generate larger embeddings based on the smaller ones.

How it works: Würstchen involves three components that required training: the encoder-decoder from VQGAN, a latent diffusion model based on U-Net, and another latent diffusion model based on ConvNeXt. The authors trained the models separately on subsets of LAION-5B, which contains matched images and text descriptions scraped from the web.

The authors trained the VQGAN encoder-decoder to reproduce input images. The encoder produced embeddings, to which the authors added noise.
To train U-Net, the authors used EfficientNetV2 (a convolutional neural network pretrained on ImageNet) to produce embeddings around 1/30 the size of the VQGAN embeddings (16x24x24 versus 4x256x256). Given this smaller embedding, a noisy VQGAN embedding, and a text description, U-Net learned to remove noise from the VQGAN embedding.
To train ConvNeXt, EfficientNetV2 once again produced small embeddings from input images, to which the authors added noise. Given a noisy EfficientNetV2 embedding and a text description, ConvNeXt learned to remove the noise.
At inference, the components worked in opposite order of training: (i) Given noise and a text prompt, ConvNeXt produced a small EfficientNetV2-sized embedding. (ii) Given that embedding, noise, and the same text prompt, U-Net produced a larger VQGAN-sized embedding. (iii) Given the larger embedding, VQGAN produced an image.

Results: The authors compared Würstchen to Stable Diffusion 2.1. While they trained both on subsets of LAION-5B, they trained Würstchen for 25,000 GPU hours while Stable Diffusion took 200,000 GPU hours. The authors generated images based on captions from MS COCO and Parti-prompts. They asked 90 people which output they preferred. The judges expressed little preference regarding renderings of MS COCO captions: They chose Würstchen 41.3 percent of the time, Stable Diffusion 40.6 percent of the time, and neither 18.1 percent of the time. However, presented with the results of Parti-prompts, they preferred Würstchen 49.5 percent of the time, Stable Diffusion’s 32.8 percent of the time, and neither 17.7 percent of the time.

Why it matters: Training a latent diffusion model to denoise smaller embeddings accelerates training, but this tends to produce lower-quality images. Stacking two diffusion models enabled Würstchen to match or exceed the output quality of models with large embeddings while achieving the training speed of models with small embeddings.

We're thinking: 25,000 GPU hours is a big reduction from 200,000! Given the cost of GPU hours, an eightfold saving is a big deal.

Subscribe to The Batch