For Faster Diffusion, Think a GAN Adversarial Diffusion Distillation, a method to accelerate diffusion models

Published

Jun 19, 2024

Reading time

2 min read

Generative adversarial networks (GANs) produce images quickly, but they’re of relatively low quality. Diffusion image generators typically take more time, but they produce higher-quality output. Researchers aimed to achieve the best of both worlds.

What's new: Axel Sauer and colleagues at Stability AI accelerated a diffusion model using a method called adversarial diffusion distillation (ADD). As the name implies, ADD combines diffusion with techniques borrowed from GANs and teacher-student distillation.

Key insight: GANs are fast because they produce images in a single step. Diffusion models are slower because they remove noise from a noisy image over many steps. A diffusion model can learn to generate images in a single denoising step if, like a GAN, it learns to fool a discriminator, while the discriminator learns to identify generated output. The resulting one-step output doesn’t match the quality of multi-step diffusion, but distillation can improve it: While learning to fool the discriminator, the diffusion model (the student) can simultaneously learn to emulate the output of a different pretrained diffusion model (the teacher).

How it works: The authors paired a pretrained Stable Diffusion XL (SDXL) generator (the student) with a pretrained DINOv2 vision transformer discriminator. The teacher was another pretrained Stable Diffusion XL with frozen weights. They didn’t specify the training dataset.

The researchers added noise to images in the training dataset. Given a noisy image and the corresponding caption, the student model removed noise in a single step.
Given the student’s output, the discriminator learned to distinguish it from the images in the dataset.
Given the student’s output with added noise plus the caption, the teacher removed the noise from the image in a single step.
The student’s loss function encouraged the model to produce images that the discriminator could not distinguish from images in the dataset and to minimize the difference between the student’s and teacher’s output.

Results: The authors tested their method using 100 prompts from PartiPrompts. They compared the student’s output after either one or four denoising steps to a pretrained SDXL after 50 denoising steps. Human judges were asked which they preferred with respect to (i) image quality and (ii) alignment with the prompt. They preferred the student’s four-step images about 57 percent of the time for image quality and about 55 percent of the time for alignment with the prompt. They preferred SDXL to the student’s one-step images around 58 percent of the time for image quality and 52 percent of the time for alignment with the prompt.

Why it matters: In this work, the key steps — having a student model learn from a teacher model, and training a generator against a discriminator — are established techniques in their own right. Combining them conferred upon the student model the advantages of both.

We're thinking: With the growing popularity of diffusion models, how to reduce the number of steps they take while maintaining their performance is a hot topic. We look forward to future advances.

Subscribe to The Batch