Over the last several decades, driven by a multitude of benchmarks, supervised learning algorithms have become really good at achieving high accuracy on test datasets. As valuable as this is, unfortunately maximizing average test set accuracy isn’t always enough.
I’ve heard too many conversations like this:
Machine learning engineer: It did well on the test set!
Product manager: But it doesn’t work for my application.
Machine learning engineer: But . . . It did well on the test set!
What else is there?
Robustness and generalization: In a production deployment, performance can degrade due to concept drift (where the function mapping from x->y changes; say, the model predicts housing prices y and inflation causes prices to rise) and data drift (where the input distribution changes). One important subset of data drift relates to performance on classes that are rare in or absent from the training set. For example, a speech recognition system may achieve high average accuracy despite poor performance on speakers with a British accent, because the training and test sets included few examples of British speakers. If the product takes off in the U.K. and a lot more British speakers jump in, its accuracy will plummet. A more robust system would fare better.
Performance on relatively important examples: Some examples are more important than others, and even if average test set accuracy is high, a system that performs poorly on important examples may be unacceptable. For example, users might forgive a search engine that doesn’t always return the best results to informational and transactional queries like “apple pie recipe” or “wireless data plan.” But when they enter a navigational query such as “stanford,” “youtube,” or “reddit,” they have a specific website in mind, and the search engine had better return the right URL or risk losing the user’s trust. In theory, weighting test examples according to their importance can address this issue, but it doesn’t always work in practice.
Performance on key slices of data: Say a machine learning system predicts whether a prospective borrower will repay a loan, so as to decide whether to approve applications. Even if average accuracy is high, if the system is disproportionately inaccurate on applications by a specific minority group, we would be foolhardy to blindly deploy it. While the need to avoid bias toward particular groups of people is widely discussed, this issue applies in contexts beyond fairness to individuals. For example, if an ecommerce site recommends products, we wouldn’t want it to recommend products from large sellers exclusively and never products from small sellers. In this example, poor performance on important slices of the data — such as one ethnicity or one class of seller — can make a system unacceptable despite high average accuracy.
My advice: If a product manager tells us that our AI system doesn’t work in their application, let’s recognize that our job isn’t only to achieve high average test accuracy — our job is to solve the problem at hand. To achieve this, we may need visualizations, larger datasets, more robust algorithms, performance audits, deployment processes like human-in-the-loop, and other tools.