Beating human-level performance (HLP) has been a goal of academic research in machine learning from speech recognition to X-ray diagnosis. When your model outperforms humans, you can argue that you’ve reached a significant milestone and publish a paper! But when building production systems, I’ve found that the goal of exceeding HLP isn’t always as useful. I believe the time has come to rethink it.
Landing AI, where I’m CEO, has been automating visual inspection for manufacturers. We’ve built computer vision systems that can look at photos of products on an assembly line and classify defects such as scratches and dents. But we’ve run into an interesting challenge: Human experts don’t always agree on the appropriate label to describe the damage. “Is this really a scratch?” If even human experts disagree on a label, what is an AI system to do?
In the past, when I built speech recognition systems, I encountered a similar problem. In some audio clips, the person speaking mumbles, or noise in the background overwhelms their words. Despite several listens, no human can transcribe them with confidence. Even when the words spoken are clear, transcriptions can be inconsistent. Is the correct transcription, “Um, today’s weather,” or “Erm . . . today’s weather”? If humans transcribe the same speech in different ways, how is a speech recognition system supposed to choose among the options?
In academic research, we often test AI using a benchmark dataset with (noisy) labels. If a human achieves 90 percent accuracy measured against those labels and our model achieves 91 percent, we can celebrate beating HLP!
But when building commercial systems, I’ve found this concept to be only occasionally useful. For example, if an X-ray diagnosis system outperforms human radiologists, does that prove — via incontrovertible logic — that hospital administrators should use it? Hardly. In practice, hospital administrators care about more than beating HLP on test-set accuracy. They also care about safety, bias, performance on rare classes, and other factors on which beating HLP isn’t feasible. So even if you beat HLP on test-set accuracy, your system isn’t necessarily superior to what humans do in the real world.
I’ve found that there are better ways to use the concept of HLP. Briefly, our goal as machine learning engineers should be to raise, rather than beat, HLP. I’ll expand on that thought in a future letter.
Working on visual inspection, my team has developed a lot of insights into applications of AI in this domain. I’ll keep sharing insights that are generally useful for machine learning practitioners here and in DeepLearning.AI’s courses. But I would like to share manufacturing-specific insights with people who are involved in that field. If you work in ML or IT in manufacturing, please drop me a note at email@example.com. I’d like to find a way to share insights and perhaps organize a discussion group.