Machine learning researchers report better and better results, but some of that progress may be illusory.
What’s new: Some models that appear to set a new state of the art haven’t been compared properly to their predecessors, Science News reports based on several published surveys. Under more rigorous comparison, they sometimes perform no better than earlier work.
Questionable progress: Some apparent breakthroughs are driven by aggregations of performance-boosting tweaks rather than core innovations. Others are mirages caused by the difficulties of comparing systems built on disparate datasets, tuning methods, and performance baselines.
- A recent MIT study found that techniques for pruning neural networks haven’t improved in a decade, despite claims that newer systems are superior. Researchers haven’t used consistent metrics and benchmarks in their comparisons, the authors said.
- A meta-analysis published in July found that recent information-retrieval models were compared using weak baselines. They didn’t outscore a model developed in 2009 on TREC Robust04, a
the bag-of-words query expansion baseline.
- A 2018 study determined that the original generative adversarial network performed as well as newer versions when it was allowed more computation to search over hyperparameters.
- The LSTM architecture that debuted in 1997 scored higher than upstart models on language modeling benchmarks once researchers adjusted its hyperparameters.
Why it matters: Machine learning can advance only through careful evaluation of earlier performance and clear measures of superiority. Erroneous claims of higher performance, even if they’re unintentional, impede real progress and erode the industry’s integrity.
We’re thinking: State-of-the-art approaches don’t necessarily lead to better results. How hyperparameters are tuned, how datasets are organized, how models are run, and how performance is measured are also critical.