A 3D Mesh From One 2D Image The combination of video diffusion and Neural Radiance Field (NeRF) can produce a 3D mesh from a single image

Published

Apr 24, 2024

Reading time

2 min read

Video diffusion provides a new basis for generating 3D meshes.

What's new: Vikram Voleti, Chun-Han Yao, Mark Boss, Varun Jampani, and colleagues at Stability AI produced a method that generates a 3D mesh from a single image based on Stability’s video diffusion model. You can see its output here.

Key insight: The approach known as a Neural Radiance Field (NeRF) learns to create a 3D mesh from images of the same object shot at various angles. Given a single image of an object, a video diffusion model can learn to generate videos that orbit around it. The frames from such orbital videos give NeRF the information it needs to produce a 3D model.

How it works: To generate a 3D mesh, the authors took one step before and two steps during inference. Before inference: Train a video diffusion model to generate an orbital video. During inference: (i) Train a NeRF model on an orbital video. (ii) Improve the 3D mesh using diffusion following DreamFusion.

The authors fine-tuned a pretrained Stable Video Diffusion, given an image of an object, to generate an orbital video. They fine-tuned the model on orbital views of synthetic objects in the Objaverse dataset, first without and then with information about the camera’s orbit. They called the fine-tuned model Stable Video 3D (SV3D).
At inference, SV3D generated an orbital video from an image, where the orbit periodically went up and down to ensure the top and bottom of the object were visible. From these images, the authors trained an Instant-NGP NeRF model, which learned to represent the object as a 3D mesh and generate pictures from new camera angles based on different views of the same object.
To improve the 3D mesh, the authors first represented it using DMTet instead of Instant-NGP. DMTet is a system of networks built to refine 3D shapes from rough point clouds or low-resolution 3D models. The authors rendered images of DMTet’s 3D model along random camera orbits. For each image, the authors added noise to the image’s representation and removed it using SV3D. DMTet learned to update its 3D model to minimize the difference between the rendered image and the updated version from SV3D.

Results: The authors produced 3D meshes from images of 50 objects in GSO, a 3D object dataset of scanned household items. They compared their 3D meshes to those produced by other methods including EscherNet, a method that uses an image diffusion model to generate images of an object from different angles that are used to train a pair of vanilla neural networks to produce a 3D mesh. Evaluated according to Chamfer distance, a measure of the distance between the points on the ground truth and generated 3D models (lower is better), their method achieved .024, while EscherNet achieved .042.

Why it matters: Video diffusion models must generate different views of the same object, so they require a greater understanding of 3D objects than image diffusion models, which need to generate only one view at a time. Upgrading from an image diffusion model to a video diffusion model makes for better 3D object generation.

We’re thinking: Building 3D meshes used to be difficult, but with models like this, it's becoming less of a mesh.

Subscribe to The Batch