Synthetic Videos on the Double VideoGPT is an efficient generative AI system for video.

Published

Jun 23, 2021

Reading time

2 min read

Using a neural network to generate realistic videos takes a lot of computation. New work performs the task efficiently enough to run on a beefy personal computer.

What’s new: Wilson Yan, Yunzhi Zhang, and colleagues at UC Berkeley developed VideoGPT, a system that combines image generation with image compression to produce novel videos.

Key insight: It takes less computation to learn from compressed image representations than full-fledged image representations.

How it works: VideoGPT comprises a VQ-VAE (a 3D convolutional neural network that consists of an encoder, an embedding, and a decoder) and an image generator based on iGPT. The authors trained the models sequentially on BAIR Robot Pushing (clips of a robot arm manipulating various objects) and other datasets.

VQ-VAE’s encoder learned to compress representations of the input video (16x64x64) into smaller representations (8x32x32) where each value is a vector. In the process, it learned an embedding whose vectors encoded information across multiple frames.
VQ-VAE replaced each vector in the smaller representations with the closest value in the learned embedding, and the decoder learned to reproduce the original frames from these modified representations.
After training VQ-VAE, the authors used the encoder to compress a video from the training set. They trained iGPT, given a flattened 1D sequence of representations, to generate the next representation by choosing vectors from the learned embedding.
To generate video, VideoGPT passed a random representation to iGPT, concatenated its output to the input, passed the result back to iGPT, and so on for a fixed number of iterations. VQ-VAE’s decoder converted the concatenated representations into a video.

Results: The authors evaluated VideoGPT’s performance using Frechet Video Distance (FVD), a measure of the distance between representations of generated output and training examples (lower is better). The system achieved 103.3 FVD after training on eight GPUs. The state-of-the-art Video Transformer achieved 94 FVD after training on 128 TPUs (roughly equivalent to several hundred GPUs).

Why it matters: Using VQ-VAE to compress and decompress video is not new, but this work shows how it can be used to cut the computation budget for computer vision tasks.

We’re thinking: Setting aside video generation, better video compression is potentially transformative given that most internet traffic is video. The compressed representations in this work, which are tuned to a specific, sometimes narrow training set, may be well suited to imagery from security or baby cams.

Subscribe to The Batch