The same models trained on the same data may show the same performance in the lab, and yet respond very differently to data they haven’t seen before. New work finds this inconsistency to be pervasive.
What’s new: Researchers explored this largely unexamined phenomenon, which they call underspecification. The team, led by Alexander D’Amour, Katherine Heller, and Dan Moldovan, spanned Google, MIT, Stanford, University of California San Diego, U.S. Department of Veterans Affairs, Aravind Eye Hospital, and Shri Bhagwan Mahavir Vitreo-Retinal Services.
Key insight: A well specified model pipeline — a model architecture, hyperparameters, training and test sets, and training procedure — should produce models that behave consistently. In practice, though, the same pipeline can produce many distinct models that achieve near-optimal performance, only some of which generalize to real-world conditions. Building a plethora of models and testing each one is the only way to know which is which.
How it works: The authors built many models per pipeline across a range of machine learning applications. Then they compared their performance on an appropriate test set and alternative data. The tests fell into three categories:
- The authors probed whether models produced using the same pipeline performed equally well on particular subsets of a test set. For example, with vision models that were trained to recognize an eye disease, they compared performance on images taken by different cameras.
- They compared performance on an established test set and a similar one with a different distribution. For instance, they compared the performance of ImageNet-trained models on both ImageNet and ObjectNet, which depicts some ImageNet classes from different angles and against different backgrounds.
- They also compared performance on examples that were modified. For instance, using a model that was trained to evaluate similarity between two sentences, they switched genders, comparing the similarity of “a man is walking” and “a doctor is walking” versus “a woman is walking” and “a doctor is walking.”
Results: The authors found highly variable performance in models produced by identical model pipelines for several practical tasks in language, vision, and healthcare. For instance, they trained 50 ResNet-50 models on ImageNet using the same pipeline except for differing random seeds. On ImageNet’s test set, the standard deviation from top-1 accuracy was 0.001. On ImageNet-C, which comprises corrupted ImageNet examples that are still recognizable to humans, the standard deviation was 0.024. A given model’s performance on one dataset didn’t correlate with its performance on the other.
Why it matters: If our models are to be useful and trustworthy, they must deliver consistent results. Underspecification is a significant barrier to that goal.
We’re thinking: This work offers a helpful framework to evaluate the model performance on similar-but-different data. But how can we specify model pipelines to produce consistent models? We eagerly await further studies in this area.