To understand a movie scene, viewers often must remember or infer previous events and extrapolate potential consequences. New work improved a model’s ability to do the same.
What's new: Rowan Zellers and colleagues at University of Washington developed Multimodal Event Representation Learning Over Time (MERLOT), a pretraining method that concentrates knowledge gleaned from videos without requiring labeled data. The resulting representations helped fine-tuned models perform a variety of video-reasoning tasks with state-of-the-art accuracy.
Key insight: Earlier work generated representations of videos by learning either to match video frames with associated text or to re-order scrambled frames in their original sequence. Training on both tasks can enable a model to generate richer representations that integrate visual, linguistic, and temporal information.
How it works: The authors divided six million YouTube videos into 180 million individual frames, each paired with corresponding text from a transcript.
- During pretraining, a ResNet-50 (the image encoder in the illustration above) generated an initial representation of each frame.
- A transformer (the language-only encoder) produced a representation of the associated text (taking into account the entire transcript up to that point).
- In contrastive fashion, the loss function encouraged matching frame and text representations to be similar and mismatches to be dissimilar.
- Another transformer received each frame representation and its corresponding text (not the text representation). It learned to guess masked words in the text as well as the proper order of the frames.
Results: MERLOT set a new state of the art for 14 tasks that involved answering questions about individual frames, answering questions about sequences of frames, and ordering disordered frames. It did especially well on question-answering tasks designed to test spatial and temporal reasoning on GIFs from Tumblr. For instance, MERLOT answered multiple-choice questions about the action performed in a clip with 94.0 percent accuracy versus the previous best score of 82.8 percent accuracy. In other areas, the improvement was less dramatic. For example, on Drama-QA, it answered multiple-choice questions about the story in clips from a TV show with 81.4 percent accuracy versus the previous best score of 81.0 accuracy.
Why it matters: MERLOT learned to pack a range of essential information about video images, accompanying text, and frame order into the representations it generated. The world is swimming in unlabeled video-plus-audio, and self-supervised learning algorithms like this could unlock tremendous value from such data.
We're thinking: We’re glad the authors didn’t keep this work bottled up.