What You See is What You Say

Published

Jul 03, 2019

Reading time

2 min read

Teaching a computer meaningful associations between words and videos typically requires training on tens of thousands of videos painstakingly annotated by hand. That’s both labor-intensive and prone to inconsistency due to the often abstract relationship between video imagery and soundtrack. Researchers at the École Normale Supérieure and elsewhere devised an effective shortcut.

What’s new: A team led by Antoine Miech and Dimitri Zhukov assembled 136 million video clips from narrated instructional videos to produce the HowTo100M data set. This large corpus, in which the words correspond closely to the images, enables new models to be trained that yield stellar results in a variety of tasks.

Key insights: Previous research has taken advantage of the tight semantic correspondence between words and imagery in instructional videos. But that work has extracted only a small number of labels rather than analyzing complete transcriptions. Miech et al. found that:

Systematic pruning of topics and auto-generating captions make it feasible to put together large video-text datasets.
Large datasets are crucial to performance. Accuracy didn’t plateau when the researchers upped clip count from 20,000 to over 1 million. More training examples likely would further boost performance.

How it works: The researchers found instructional videos on YouTube. Their model consists of a collection of pretrained subnetworks to extract video and word features. It's trained to maximize the similarity between video and word vectors belonging to the same video.

The researchers collected narrated instructional videos in domains associated with actions (like cooking) rather than ideas (finance). They focused on visual tasks (baking a cake) rather than abstract tasks (choosing a gift).
They used an existing speech recognition model to generate video captions from the narration.
To extract video feature vectors, the researchers used pretrained 2D ResNet-52 and 3D ResNeXt-101 models.
To encode word vectors, they trained a 1D CNN with pretrained Word2Vec embeddings as inputs.
The model is trained on an objective function that maximizes the cosine of the angle between a pair of corresponding word and video vectors.

Results: The team bettered the previous state of the art by as much as 50 percent on several clip retrieval benchmarks. Models pretrained on HowTo100M and fine-tuned on other data showed significant improvements in the target task compared to models trained from scratch.

Why it matters: At 136 million clips, HowTo100M is the largest video-text public data set, dwarfing previous efforts by orders of magnitude. The resulting video-text embeddings could dramatically improve the accuracy of neural networks devoted to tasks like video captioning, scene search, and summarizing clips.

Takeaway: HowTo100M widens the intersection of computer language and vision. It could lead to better search engines that retrieve relevant scenes described in natural language, or systems that generate artificial videos in response to natural-language queries. More broadly, it’s a step closer to machines that can talk about what they see.

Subscribe to The Batch