Building AI systems is hard. Despite all the hype, AI engineers struggle with difficult problems every day. For the next few weeks, I’ll explore some of the major challenges. Today’s topic: The challenge of building AI systems that are robust to real-world conditions.
The accuracy of supervised learning models has grown by leaps and bounds thanks to deep learning. But there’s still a huge gap between building a model in a Jupyter notebook and shipping a valuable product.
Multiple research groups, including mine and several others, have published articles reporting DL’s ability to diagnose from X-ray or other medical images at a level of accuracy comparable or superior to radiologists. Why aren’t these systems widely deployed?
I believe robustness is a major impediment. For example, if we collect data from a top research hospital that has well trained X-ray technicians and high-quality X-ray machines, and we train and test a state-of-the-art model on data from this hospital, then we can show comparable or superior performance to a radiologist.
But if we ship this algorithm to an older hospital with less well-trained technicians or older machines that produce different-looking images, then the neural network likely will miss some medical conditions it spotted before and see others that aren’t really there. In contrast, any human radiologist could walk over to this older hospital and still diagnose well.
I have seen this sort of challenge in many applications:
- A speech recognition system was trained primarily on adult voices. After it shipped, the demographic of users started trending younger. The prevalence of youthful voices caused performance to degrade.
- A manufacturing visual inspection system was trained on images collected on-site over one month. Then the factory’s lighting changed. Performance degraded in turn.
- After engineers shipped a web page ranking system, language patterns evolved and new celebrities rose to fame. Search terms shifted, causing performance to degrade.
As a community, we are getting better at addressing robustness. Approaches include technical solutions like data augmentation and post-deployment monitoring along with setting alarms to make sure we fix issues as they arise. There are also nascent attempts to specify operating conditions under which an algorithm is safe to use, and even more nascent attempts at formal verification. Robustness to adversarial attacks is another important consideration, but most practical robustness issues that I see involve non-adversarial changes in the data distribution.
One of the challenges of robustness is that it is hard to study systematically. How do we benchmark how well an algorithm trained on one distribution performs on a different distribution? Performance on brand-new data seems to involve a huge component of luck. That’s why the amount of academic work on robustness is significantly smaller than its practical importance. Better benchmarks will help drive academic research.
Many teams are still addressing robustness via intuition and experience. We, as a community, have to develop more systematic solutions.