OpenAI’s DALL·E got an upgrade that takes in text descriptions and produces images in styles from hand-drawn to photorealistic. The new version is a rewrite from the ground up. It uses the earlier CLIP zero-shot image classifier to represent text descriptions. To generate images, it uses a method first described in a recent paper.
Imagination engine: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, and colleagues at OpenAI published GLIDE, a diffusion model that produces and edits images in response to text input.
Diffusion model basics: During training, this generative approach takes noisy images and learns to remove the noise. At inference, it starts with pure noise and generates an image.
Key insight: Previous work showed that, given a class label in addition to an image, a diffusion model can generate new images of that class. Likewise, given a representation of text as an additional input, it should produce output that reflects the representation.
How it works: GLIDE used a transformer and ADM, a convolutional neural network outfitted with attention. Like DALL·E, the system was trained on 250 million image-text pairs collected from the internet. Unlike DALL·E, the authors added noise to each image incrementally to produce 150 increasingly noisy examples per original.
- During training, the transformer learned to create representations of input text.
- Given the representations and a noisy example, ADM learned to determine the noise that, when added to the previous image in the series, resulted in the current example. In this way, the system learned to remove the noise that had been added at each step.
- At inference, given a text description and noise, GLIDE determined and removed noise 150 times, producing an image.
- The authors boosted the influence of the text using classifier-free guidance. The model first determined the noise while ignoring the text representation and did it again while using the text representation. It scaled up the difference between the two noises and used the result to generate the noise to be removed.
- To edit images according to text descriptions, the authors replaced image regions with noise. The system then modified the noise iteratively while leaving the rest of the image intact.
Results: Human evaluators rated GLIDE’s output more photorealistic than DALL·E’s in 91 percent of 1,000 comparisons. They ranked GLIDE’s images more similar to the input text than DALL·E’s 83 percent of the time. The authors reported only qualitative results for the model’s ability to edit existing images, finding that it introduced objects in an appropriate style with good approximations of illumination, shadows, and reflections.
Yes, but: GLIDE’s photorealistic output comes at a cost of inference time. It took 15 seconds — far longer than GAN-based text-to-image generators, which generally take a fraction of a second.
Why it matters: Generative models typically are hard to control in an intuitive way. Enabling users to direct photorealistic image generation via natural language opens the door to broader and more widespread uses.
We’re thinking: Diffusion models are emerging as an exciting alternative among generative architectures. GLIDE’s 3.5 billion-parameter implementation (which, while very large, is roughly a quarter the size of DALL·E) is further evidence.