Multimodal Event Representation Learning Over Time (MERLOT)

1 Post

Animation showing how MERLOT is able to match contextualized captions with their corresponding video frames

Richer Video Representations: Pretraining Method Improves AI's Ability to Understand Video

To understand a movie scene, viewers often must remember or infer previous events and extrapolate potential consequences. New work improved a model’s ability to do the same.

Multimodal Event Representation Learning Over Time (MERLOT)

Richer Video Representations: Pretraining Method Improves AI's Ability to Understand Video

Subscribe to The Batch