Long-Form Videos from Text Stories Google's Phenaki Generates Long-Form Video from Text

Published

Oct 12, 2022

Reading time

3 min read

Only a week ago, researchers unveiled a system that generates a few seconds of video based on a text prompt. New work enables a text-to-video system to produce an entire visual narrative from several sentences of text.

What’s new: Ruben Villegas and colleagues at Google developed Phenaki, a system that produces videos of arbitrary length from a story-like description. You can see examples here.

Key insight: The machine learning community lacks a large dataset of long-form videos and time-aligned captions, so it’s not obvious how to train a model to synthesize long videos from a narrative. But text-image pairs are plentiful. A system can be trained to generate short videos by treating images as single-frame videos and combining them with a relatively smaller dataset of short videos with captions. Then the video can be extended by feeding the system new text plus the last few generated frames. Repeating this process can generate long, complex videos even though the model was trained on short, simple ones.

How it works: Phenaki uses an encoder to produce video embeddings, a language model to produce text embeddings, a bidirectional transformer to take the text and video embeddings and synthesize new video embeddings, and a decoder to translate synthesized video embeddings into pixels.

Using a dataset of videos less than three seconds long, the authors pretrained a C-ViViT encoder/decoder (a variant of ViViT adapted for video) to compress frames into embeddings and decompress them into the original frames. The encoder divided frames into non-overlapping patches and learned to represent the patches as vectors. Transformer layers honed each patch’s embedding according to all patches within the same frame and all previous frames. The decoder learned to translate the embeddings into pixels.
Given a piece of text, t5x language model pretrained on web text produced a text embedding.
The authors pretrained a MaskGIT bidirectional transformer on embeddings produced by C-ViViT for 15 million proprietary text-video pairs (each video lasted 1.4 seconds at 8 frames per second), 50 million proprietary text-image pairs, and 400 million text-image pairs scraped from the web. They masked a fraction of the video embeddings and trained MaskGIT to reconstruct them.
At inference, MaskGIT took the text embeddings and a series of masked video embeddings (since no video had been generated yet), generated the masked embeddings, then re-masked a fraction of them to be generated in the next iterations. In 48 steps, MaskGIT generated all the masked embeddings.
The C-ViViT decoder took the predicted embeddings and rendered them as pixels.
The authors applied MaskGIT and C-ViViT iteratively to produce minutes-long videos. First they generated a short video from one sentence, then encoded the last k generated frames. They used the video embeddings and the next piece of text to generate further video frames.

Results: The full-size Phenaki comprised 1.8 billion parameters. In the only quantitative evaluation of the system’s text-to-video capability, the authors compared a 900 million-parameter version of Phenaki trained on half of their data to a 900 million-parameter NUWA pretrained on text-image pairs, text-video pairs, and three-second videos and fine-tuned on 10-second videos. (Phenaki was not fine-tuned.) The downsized Phenaki achieved 3.48 FID-Video compared to NUWA’s 7.05 FID-Video (a measure of similarity between generated and original videos, lower is better).

Why it matters: Last week’s Make-A-Video used a series of diffusion models that generate a short video from a text description and upscale its temporal and image resolution. Phenaki bootstrapped its own generated frames to extend the output’s length and narrative complexity. Together, they may point to a revolution in filmmaking.

We’re thinking: One challenge of the recent approaches is maintaining consistency across spans of frames. In the clip shown above, for example, the lion’s appearance at the beginning differs from its appearance at the end. We don’t regard this as a fundamental problem, though. It seems like only a matter of time before an enterprising developer devises an attention-based/transformer architecture that resolves the issue.

Subscribe to The Batch