Researchers struggle to build models that can generate a three-dimensional scene from a text prompt largely because they lack sufficient paired text-3D training examples. A new approach works without any 3D data whatsoever.
What's new: Ben Poole and colleagues at Google and UC Berkeley built DreamFusion to produce 3D scenes from text prompts. Rather than training on text-3D pairs, the authors used a pretrained text-to-image diffusion model to guide the training of a separate model that learned to represent a 3D scene.
Key insight: A neural radiance field (NeRF) learns to represent a 3D scene from 2D images of that scene. Is it possible to replace the 2D images with a text prompt? Not directly, but a pretrained text-to-image diffusion model, which generates images by starting with noise and removing the noise in several steps, can take a text prompt and generate 2D images for NeRF to learn from. The NeRF image (with added noise) conditions the diffusion model, and the diffusion model’s output provides ground truth for the NeRF.
How it works: NeRF generated a 2D image, and the authors added noise. Given the noisy NeRF image and a text prompt, a 64x64 pixel version of Google's Imagen text-to-image diffusion model removed the noise to produce a picture that reflected the prompt. By repeating these steps, NeRF gradually narrowed the difference between its output and Imagen’s.
- Given a camera position, angle, and focal length as well as a light position, NeRF (which started out randomly initialized) rendered an image of the scene. The authors applied a random degree of noise to the image.
- Given the noisy image, a text prompt, and a simple text description of the camera angle (“overhead view,” “front view,” “back view,” or “side view”), Imagen removed the noise, generating a more coherent image that better reflected the prompt.
- The authors trained NeRF to minimize the difference between its own image and Imagen’s. They repeated the cycle 15,000 times using the same prompt, a different camera angle, and a different light position each time.
- The following technique kept NeRF from interpreting the prompt on a flat surface (painting, say, a peacock on a surfboard on a flat surface rather than modeling those elements in 3D): At random, NeRF rendered the scene either (i) without colors but with shading (the pattern of light and dark formed by light reflecting off 3D objects), (ii) with colors but without shading, or (iii) with both colors and shading.
- Having trained NeRF, the authors extracted a 3D mesh using the marching cubes algorithm.
Results: The authors compared DreamFusion images to 2D renders of output from CLIP-Mesh, which deforms a 3D mesh to fit a text description. They evaluated the systems according to CLIP R-Precision, a metric that measures the similarity between an image and a text description. For each system, they compared the percentage of images that were more similar to the prompt than to 153 other text descriptions. DreamFusion achieved 77.5 percent while CLIP-Mesh achieved 75.8 percent. (The authors note that DreamFusion’s advantage is all the more impressive considering an overlap between the test procedure and CLIP-Mesh’s training).
Why it matters: While text-3D data is rare, text-image data is plentiful. This enabled the authors to devise a clever twist on supervised learning: To train NeRF to transform text into 3D, they used Imagen’s text-to-image output as a supervisory signal.