Misleading Metrics Advances in metric learning may be illusions.

Published

Jun 24, 2020

Reading time

2 min read

A growing body of literature shows that some steps in AI’s forward march may actually move sideways. A new study questions advances in metric learning.

What’s new: Kevin Musgrave and researchers at Cornell Tech and Facebook AI re-examined hallmark results in models that learn to predict task-specific distance metrics, specifically networks designed to quantify the similarity between two images. They found little evidence of progress.

Key insight: Researchers have explored metric learning by experimenting with factors like architectures, hyperparameters, and optimizers. But when those factors aren’t held constant, comparisons with earlier approaches can’t be apples-to-apples. Improved results often reflect advances in the surrounding components (such as hyperparameter tuning), not in the metric learning algorithm itself.

How it works: Models that assess similarity between images typically extract features and predict a distance between them. The distances may be learned through a metric loss function, while features are often extracted from pre-trained networks. The authors reviewed 12 of the most popular papers on this topic. They point out common causes of invalid comparisons and present a new approach that levels the playing field.

Several papers compare a ResNet50 to a GoogLeNet trained using different methods. A ResNet50 is larger and outperforms a GoogLeNet on other image processing tasks, so it’s no surprise the ResNet50 performs better at metric learning.
Many researchers don’t use a validation set but still tune hyperparameters. Presumably, they chose hyperparameter values based on the models’ performance on test sets — a big no-no.
These two flaws, and a list of smaller mistakes, inspired the authors to propose a consistent test bed for metric learning research. Their benchmark calls for BN-Inception networks, RMSprop optimizers, and cross-validation for hyperparameter search.

Results: The authors reproduced and benchmarked many past approaches on the CUB200, Cars 196, and Stanford Online Products datasets. Controlling for confounding variables, their analysis shows that metric learning hasn’t improved since 2006 (as shown in the plots above).

Why it matters: Image similarity is a key component of many products (such as image-based search). Knowing what really works is key to helping practitioners build useful applications as well as drive further research.

We’re thinking: Metric learning still has a distance to go.

Subscribe to The Batch