Yale Song Foundation models for vision

Published

Dec 29, 2021

Reading time

2 min read

Large models pretrained on immense quantities of text have been proven to provide strong foundations for solving specialized language tasks. My biggest hope for AI in 2022 is to see the same thing happen in computer vision: foundation models pretrained on exabytes of unlabeled video. Such models, after fine-tuning, are likely to achieve strong performance and provide label efficiency and robustness for a wide range of vision problems.

Foundation models like GPT-3 by OpenAI and Gopher by DeepMind have shown a powerful ability to generalize in numerous natural language processing tasks, and vision models pretrained jointly on images and text, such as CLIP by OpenAI, Florence by Microsoft, and FLAVA by Facebook have achieved state-of-the-art results on several vision-and-language understanding tasks. Given the large amount of video readily available, I think the most promising next step is to investigate how to take advantage of unlabeled video to train large-scale vision models that generalize well to challenging real-world scenarios.

Why video? Unlike static images, videos capture dynamic visual scenes with temporal and audio signals. Neighboring frames serve as a form of natural data augmentation, providing various object (pose, appearance), camera (geometry), and scene (illumination, object placements) configurations. They also capture the chronological order of actions and events critical for temporal reasoning. In these ways, the time dimension provides critical information that can improve the robustness of computer vision systems. Furthermore, the audio track in video can contain both natural sounds and spoken language that can be transcribed into text. These multimodal (sound and text) signals provide complementary information that can aid learning visual representations.

Learning from large amounts of unlabeled video poses unique challenges that must be addressed by both fundamental AI research and strong engineering efforts:

What architectures are most appropriate to process multimodal signals from video? Can they be handcrafted, or should we search out optimal architectures that capture the inductive biases of multimodal data more effectively?
What are the most effective ways to use temporal and multimodal information in an unsupervised or self-supervised manner?
How should we deal with noise such as compression artifacts, visual effects added after recording, abrupt scene changes, and misalignment between imagery, soundtrack, and transcribed audio?
How can we design challenging video tasks that measure progress in a conclusive manner? Existing video benchmarks contain human actions that are short-term (e.g., run, push, pull) and some are easily recognized from a single frame (e.g., playing a guitar). This makes it difficult to draw conclusive insights. What kinds of tasks would be compelling and comprehensive for video understanding?
Video processing is notoriously resource heavy. How can we develop compute- and memory-efficient video models to speed up large-scale distributed training?

These are exciting research and engineering challenges for which I hope to see significant advances in 2022.

Yale Song is a researcher at Microsoft Research in Redmond, where he works on large-scale problems in computer vision and artificial intelligence.

Subscribe to The Batch