Key insight: While billions of text-image pairs are available to train a text-to-image generator, text-video pairs are too scarce to train a video equivalent. A model can learn relationships between words and pictures via pretraining on text-image pairs. Then it can be adapted for video by adding further layers that process image patches across frames and — while keeping the pretrained layers fixed — fine-tuning the new layers on videos, which are plentiful. In this way, a system can generate videos using knowledge it learned from text-image pairs.
How it works: The authors pretrained a series of models (one transformer and four U-Net diffusion models) to generate images from text, generate in-between video frames, and boost image resolution. To pretrain the text-to-image models, they used 2.3 billion text-image pairs. After pretraining, they modified some of the models to process sequences of video frames: On top of each pretrained convolutional layer, the authors stacked a 1D convolutional layer that processed a grid of pixels in each frame; and on top of each pretrained attention layer, they stacked a 1D attention layer that, likewise, processed a grid of pixels in each frame. To fine-tune or train the modified models on video, they used 20 million internet videos.
- Given a piece of text, the pretrained transformer converted it into an embedding.
- The authors pretrained a diffusion model to take the embeddings and generate a 64x64 image. Then they modified the model as described above and fine-tuned it to generate sequences of 16 frames of 64x64 resolution.
- They added a second diffusion model. Given a 76-frame video made up of 16 frames, each followed by four masked (blacked-out) frames, it learned to regenerate the masked frames.
- They added a third diffusion model and pretrained it, given a 64x64 image, to increase the image’s resolution to 256x256. After modifying the model, they fine-tuned it to increase the resolution 76 successive frames to 256x256.
- Given a 256x256 image, a fourth diffusion model learned to increase its resolution to 768x768. Due to memory restrictions, this model was not modified for video or further trained on videos. At inference, given the 76-frame video, it increased the resolution of each frame without reference to other frames.
Results: The authors compared their system’s output to that of the previous state of the art, CogVideo, which takes a similar approach but requires training on text-video pairs. Crowdworkers supplied 300 prompts and judged the output of the author’s system to be of higher quality 77.15 percent of the time and to better fit the text 71.19 percent of the time.
Why it matters: Text-to-image generators already transform text into high-quality images, so there’s no need to train a video generator to do the same thing. The authors’ approach enabled their system to learn about things in the world from text-image pairs, and then to learn how those things move from unlabeled videos.
We're thinking: The Ng family’s penchant for drawing pandas is about to undergo another revolution!