Images From Noise An upgrade for score-based generative AI models.

Published

Jan 27, 2021

Reading time

3 min read

Generative adversarial networks and souped-up language models aren’t the only image generators around. Researchers recently upgraded an alternative known as score-based generative models.

What’s new: Yang Song and Stefano Ermon at Stanford derived a procedure for selecting hyperparameter values for their earlier score-based generator, which produces images from noise. Finding good hyperparameters enabled the authors to generate better images at higher resolution.

Key insight: Score-based image generation uses a model that learns how to change images corrupted by additive noise to reproduce the original pictures, and an algorithm that executes the changes to produce fresh images. The earlier work relied on manual tuning to find good values for hyperparameters such as how much noise to add to training images. Real-world data distributions are hard to analyze mathematically, so, in the new work, the authors approximated them with simplified distributions. Given the simpler scenario, they could analyze how each hyperparameter would affect both training and inference, enabling them to derive methods to compute hyperparameter values.

About score-based generation: The process starts with producing many versions of each training dataset by adding various magnitudes of noise. A modified RefineNet is trained to minimize the difference between its prediction of the way, and the actual way, to change a noisy example into a clean one. RefineNet learns a vector field: Given a point in space that corresponds to an image, it returns a vector that represents the direction toward a more realistic image. Then an algorithm based on Langevin dynamics (a set of equations developed to model the way molecules interact in a physical system) moves the point in that direction. The process of finding a vector and moving in that direction repeats for a finite number of steps.

How it works: The authors followed their score-based procedure but used new methods to compute hyperparameters that governed the noise added to the training dataset and the size and number of steps computed by Langevin dynamics. We’ll focus on the noise hyperparameters in this summary.

Very noisy datasets are needed to train the network to produce an image from noise. However, too many highly noisy training examples make it hard for the network to learn. So it’s necessary to balance noisy and less-noisy examples carefully.
What’s the greatest amount of noise to add? To train the network to generate images that reflect the entire training data distribution, Langevin dynamics must be able to transition between any two training examples. So the greatest noise, as measured by the Euclidean distance between noisy and noise-free examples, should be equal to the maximum distance between any pair of noise-free training examples.
What should be the difference between the greatest and next-greatest amounts of noise added, and how many increments should there be? The authors examined a scenario with only one training example. For RefineNet to supply good directions, it must learn to chart a path from any point in the vector field to that example. To do that, the added noise must leave no areas where, randomly, noisy data doesn’t occur. Based on that principle, they derived an equation to determine how many noisy datasets to produce and what increments of noise to apply.

Results: The authors evaluated their new model’s output using Frechet Inception Distance (FID), a measure of how well a generated data distribution resembles the original distribution, where lower is better. Trained on 32×32 images in CIFAR-10, the model achieved 10.87 FID, a significant improvement over the earlier model’s 25.32 FID. It also beat SNGAN, which achieved 21.7 FID. The paper doesn’t compare competing FID scores at resolutions above 32×32 and omits FID scores altogether at resolutions higher than 64×64. It presents uncurated samples up to 256×256.

Why it matters: GANs often don’t learn to produce good images because the objectives of their generator and discriminator are at odds. Score-based generative models optimize for only one objective, which eliminates this risk. That said, they may fail to converge for other reasons.

We’re thinking: We love the idea of using mathematical reasoning to derive optimal hyperparameter values: More time to develop good models!

Subscribe to The Batch