Text-to-video generation is so 2022! A new system takes in text and generates an animated 3D scene that can be viewed or rendered from any angle.
What’s new: Uriel Singer and colleagues at Meta AI proposed Make-A-Video3D (MAV3D). Lacking a corpus of matched text and animated 3D scenes, the authors used a pretrained text-to-video diffusion model to guide the training of a neural radiance field (NeRF) model that learned how to represent a 3D scene with moving elements. You can see MAV3D’s output here.
Key insight: Earlier work known as DreamFusion learned to produce a 3D scene from text by setting up a feedback loop between a pretrained diffusion text-to-image generator, which creates 2D images according to a text prompt, and a NeRF, which takes embeddings of points in space and learns to produce a 3D scene (mesh, point colors, and point transparencies) to match the 2D images shot from various angles. (NeRF can also generate images of the scene.) Basically, (i) the NeRF generated 2D images of a random 3D scene; (ii) the images — with added noise — were given as input to the diffusion text-to-image generator, which sharpened them according to the text prompt; and (iii) the NeRF used the sharpened images to sharpen the 3D scene, repeating the cycle. MAV3D worked the same way but (a) used a more computationally efficient embedding method called HexPlane, (b) swapped the pretrained text-to-image generator for a pretrained text-to-video generator, and (c) modified the NeRF to generate sequences of video frames. The resulting system takes a text prompt and learns to generate a matching 3D scene that changes over time.
How it works: MAV3D is an animated version of the earlier DreamFusion, as described above. It includes the following models: HexPlane (which efficiently represents an animated 3D scene), Make-A-Video (a text-to-video generator pretrained on LAION-5B text/image pairs and fine-tuned on 20 million videos), and a NeRF modified for video/animation.
- HexPlane learned an embedding for each point on each 2D plane in an animated 3D scene (xy, xz, xt, yz, yt, and zt) over 16 video frames. Given a point (three spatial dimensions plus time), the model projected it onto each plane, retrieved the corresponding embeddings, and concatenated them to produce a point embedding.
- Given the embeddings and a random camera position per frame, NeRF produced a video.
- The system added noise to the NeRF video and fed it to Make-A-Video. Given a text prompt, Make-A-Video estimated what the video would look like without the noise.
- The loss function minimized the difference between the NeRF video and Make-A-Video’s denoised version to update HexPlane and NeRF.
- The system cycled through this process 12,000 times using a different random camera trajectory each time, which enabled it to evaluate every point from multiple angles.
- The authors extracted from NeRF a 64-frame animated 3D scene using the marching cubes algorithm.
Results: No other system generates animated 3D scenes from text, so the authors compared MAV3D with systems that solve two sub-tasks, generating 3D static scenes from text and generating videos from text. They used CLIP R-Precision, a metric that evaluates the similarity between an image and a text description (higher is better), to measure the systems’ performance averaged across a number of images taken from different angles (for 3D scenes) or images over time (for videos). MAV3D outperformed a Stable Diffusion implementation of DreamFusion (82.4 CLIP R-Precision versus 66.1 CLIP R-Precision). However, it did worse than Make-A-Video (79.2 CLIP R-Precision versus 86.6 CLIP R-Precision).
Yes, but: Examples of MAV3D’s output include very short scenes of varying quality. The system allows only one color per point so, for instance, reflective surfaces look the same regardless of viewing angle. It’s also computationally demanding: It took 6.5 hours per scene using eight A100 GPUs.
Why it matters: Adapting NeRF for video/animation is exciting, but the larger lesson is that finding an efficient way to learn representations — HexPlane in this case — can make tasks feasible that otherwise would require impractical amounts of computation.
We’re thinking: While MAV3D’s rendering would be improved by variable colors to represent reflections, shadows, and dynamic lighting, its strong performance relative to DreamFusion suggests a way to improve text-to-3D: train on videos instead of images. Videos contain moving objects and sometimes changing camera positions, so they can depict more diverse 3D geometry than a set of static images. Learning from videos could avoid generating 3D images that look fine from only one angle at a time.