Multimodal Event Representation Learning Over Time (MERLOT)

1 Post

Animation showing how MERLOT is able to match contextualized captions with their corresponding video frames
Multimodal Event Representation Learning Over Time (MERLOT)

Richer Video Representations: Pretraining Method Improves AI's Ability to Understand Video

To understand a movie scene, viewers often must remember or infer previous events and extrapolate potential consequences. New work improved a model’s ability to do the same.
2 min read

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox