In this series exploring why machine learning projects fail, let’s examine the challenge of “small data.”
Given 1 million labeled images, many teams can build a good classifier using open source. But say you are building a visual inspection system for a factory to detect scratches on smartphones. No smartphone manufacturer has made 1 million scratched phones (that would have to be thrown away), so a dataset of 1 million images of scratched phones does not exist. Getting good performance with 100 or even 10 images is needed for this application.
Deep learning has seen tremendous adoption in consumer internet companies with a huge number of users and thus big data, but for it to break into other industries where dataset sizes are smaller, we now need better techniques for small data.
In the manufacturing system described above, the absolute number of examples was small. But the problem of small data also arises when the dataset in aggregate is large, but the frequency of specific important classes is low.
Say you are building an X-ray diagnosis system trained on 100,000 total images. If there are few examples of hernia in the training set, then the algorithm can obtain high training- and test-set accuracy, but still do poorly on cases of hernia.
Small data (also called low data) problems are hard because most learning algorithms optimize a cost function that is an average over the training examples. As a result, the algorithm gives low aggregate weight to rare classes and under-performs on them. Giving 1,000 times higher weight to examples from very rare classes does not work, as it introduces excessive variance.
We see this in self-driving cars as well. We would like to detect pedestrians reliably even when their appearance (say, holding an umbrella while pushing a stroller) has low frequency in the training set. We have huge datasets for self-driving, but getting good performance on important but rare cases continues to be challenging.
How do we address small data? We are still in the early days of building small data algorithms, but some approaches include:
- Transfer learning, in which we learn from a related task and transfer knowledge over. This includes variations on self-supervised learning, in which the related tasks can be “made up” from cheap unlabeled data.
- One- or few-shot learning, in which we (meta-)learn from many related tasks with small training sets in the hope of doing well on the problem of interest. You can find an example of one-shot learning in the Deep Learning Specialization.
- Relying on hand-coded knowledge, for example through designing more complex ML pipelines. An AI system has two major sources of knowledge: (i) data and (ii) prior knowledge encoded by the engineering team. If we have small data, then we may need to encode more prior knowledge.
- Data augmentation and data synthesis.
Benchmarks help drive progress, so I urge the development of small data benchmarks in multiple domains. When the training set is small, ML performance is more variable, so such benchmarks must allow researchers to average over a large number of small datasets to obtain statistically meaningful measures of progress.
My teams are working on novel small data techniques, so I hope to have details to share in the future.