Video search engines are often evaluated based on how they rank a single video when presented with a brief description that accompanies that video in the test set. But this criterion may not reflect a system's utility in the real world, where numerous videos may be highly relevant to the search terms. New work aims to solve this problem.

What’s new: Researchers at the University of Bristol led by Michael Wray propose a new benchmark, Semantic Similarity Video Retrieval (SVR), that evaluates video retrieval systems by their ability to rank many similar videos. They also built a system that performed well on it.

Key insight: To evaluate a video retrieval system based on how similar the top-ranked videos are to an input description, the evaluation process needs a ground-truth measure of similarity between descriptions and videos. There isn’t an automatic way to compare a description to a video, but there are several ways to compare a description to other descriptions. The authors assessed the similarity between existing descriptions to approximate ground-truth similarity between descriptions and videos. This enabled them to train their system to rank the similarity of input text to a variety of videos, and to evaluate the quality of its search results.

How it works: The authors generated separate representations for captions and videos and honed the similarity of matching descriptions and videos. Given a description, the system learned to rank clips whose video representation best matched that of the input (and vice-versa). They trained and tested it on videos with descriptions from movies, news, how-tos, and other sources.

  • The authors calculated similarity between each description and every other description using METEOR. If the similarity between two descriptions exceeded a threshold, they matched the description with the video bearing the other caption.
  • They used these matches to train a system that included a GPT-based language model, which generated representations of descriptions, and a combination of convolutional neural networks, which generated representations of videos. A triplet loss encouraged the system to produce similar representations of matched descriptions and videos and dissimilar representations of unmatched ones.
  • Given input text (for the purpose of evaluation, an existing description), they ranked the top-matching videos according to the cosine similarity between the representations of the text and the representation of the videos.

Results: The authors measured how well their system ranked each video with respect to every description (and vice-versa) using nDCG. This method rewards high rankings of similar representations (as measured by METEOR) and penalizes high rankings of dissimilar representations. The authors’ system scored 0.840 out of a perfect 1.0. A baseline system that used two vanilla neural networks to create video and description embeddings scored .833.

Why it matters: Rather than designing a system to ace a common test, the authors devised a new test that better reflects what users expect from such systems. That approach should lead to more useful systems all around.

We’re thinking: The more machine learning improves, the more we need benchmarks that are capable of measuring the latest improvements.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox