Research shows that diffusion image generators can train somewhat faster if they learn to reconstruct embeddings from a pretrained encoder that’s built for vision tasks like classification, segmentation, and retrieval — not image generation. Recent work shows they can train dramatically faster if the diffusion model learns to reconstruct a smaller version of these embeddings.
What’s new: Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu at Apple proposed Feature Auto-Encoder (FAE), a diffusion image generator that learned to reconstruct embeddings produced by the vision encoder DINOv2. Where earlier work bogged down on the large size of DINOv2’s embeddings, FAE learned to shrink them before reconstructing them.
Key insight: Latent diffusion models, which remove noise from embeddings that a decoder then turns into images, produce better images from larger embeddings. Embeddings produced by vision encoders such as DINOv2 and SigLIP fit the bill: They’re large and semantically rich. This approach not only improves the output, it potentially accelerates training because the image generator takes advantage of the vision encoder’s pretraining. On the other hand, processing larger embeddings requires a larger architecture and significantly more training, which counteracts much of the speed-up. A solution is to shrink the vision encoder’s embeddings using a second, smaller encoder. The diffusion model can learn to remove noise from this smaller embedding, which requires less training. Then decoders can expand the embedding to its original embedding space, restoring the benefits of the larger embedding, and ultimately produce an image.
How it works: At inference, FAE works like this: Given noise and an ImageNet class or text embedding, SiT (a latent diffusion model) generates a shrunken image embedding. Given that embedding, an embedding decoder produces a full-size embedding, from which an image decoder produces an image. The authors trained two separate FAE systems to generate images from input text labels (using ImageNet at 256x256-pixel resolution) and text descriptions (using CC12M.)
- During training, FAE needed targets to learn to generate the full-size and shrunken embeddings. Given an image, DINOv2 produced a full-size embedding. Given the full-size embedding, a small encoder (a single attention layer) produced the shrunken embedding. Given the shrunken embedding, an embedding decoder expanded it to full size.
- Given the expanded embedding, an image decoder learned to produce an image. It used three loss terms: (i) a reconstruction loss that minimized the difference between predicted and ground-truth pixels; (ii) a perceptual loss in which an unidentified model embedded both a predicted image and ground truth, and the loss term minimized the distance between them; and (iii) an adversarial loss in which an unidentified discriminator classified predicted images and ground truth, and the loss term trained the image decoder to fool it.
- Given a noisy version of a shrunken embedding plus an ImageNet class or text embedding produced by a pretrained SigLIP 2), SiT learned to remove the noise.
Results: FAE performed comparably to state-of-the-art diffusion models while training faster.
- Generating ImageNet images from labels, FAE (675 million parameters) achieved 1.29 FID (a measure of the difference between Inception-V3’s embeddings of an original image and a generated version, lower is better) after training for 800 epochs. This was better than RAE (676 million parameters), a diffusion image generator that learned to reconstruct DINOv2 embeddings without shrinking them. RAE achieved 1.41 FID after 800 epochs. FAE reached 1.41 FID after around seven times faster, in 110 epochs.
- Generating images from text descriptions, FAE (1.1 billion parameters) achieved 6.9 FID on MS COCO. This performance was similar to Re-Imagen (3.2 billion parameters), which reached 6.88 FID after training on roughly four times more training data.
Why it matters: Encoders trained for vision tasks have different knowledge than encoders trained for image generation. That knowledge can help to generate better images, but it resides in bigger embeddings that require more processing power to handle. Shrinking them makes this knowledge available to image generators in a more practical way, making it possible to generate high-quality images in much less time.
We’re thinking: This work produces better images in less training time not by getting better at removing noise, but by removing noise from a richer embedding space.