It’s time to move beyond the stereotype that machine learning systems need a lot of data. While having more data is helpful, large pretrained models make it practical to build viable systems using a very small labeled training set — perhaps just a handful of examples specific to your application.
About 10 years ago, with the rise of deep learning, I was one of the leading advocates for scaling up data and compute to drive progress. That recipe has carried us far, and it continues to drive progress in large language models, which are based on transformers. A similar recipe is emerging in computer vision based on large vision transformers.
But once those models are pretrained, it takes very little data to adapt them for a new task. With self-supervised learning, pretraining can happen on unlabeled data. So, technically, the model did need a lot of data for training, but that was unlabeled, general text or image data. Then, even with only a small amount of labeled, task-specific data, you can get good performance.
For example, say you have a transformer trained on a massive amount of text, and you want it to perform sentiment classification on your own dataset. The most common techniques are:
- Fine-tuning the model to your dataset. Depending on your application, this can be done with dozens or even fewer examples.
- Few-shot learning. In this approach, you create a prompt that includes a few examples (that is, you write a text prompt that lists a handful of pieces of text and their sentiment labels). A common technique for this is in-context learning.
- Zero-shot learning, in which you write a prompt that describes the task you want done.
These techniques work well. For example, customers of my team Landing AI have been building vision systems with dozens of labeled examples for years.
The 2010s were the decade of large supervised models, I think the 2020s are shaping up to be the decade of large pretrained models. However, there is one important caveat: This approach works well for unstructured data (text, vision and audio) but not for structured data, and the majority of machine learning applications today are built on structured data.
Models that have been pretrained on diverse unstructured data found on the web generalize to a variety of unstructured data tasks of the same input modality. This is because text/images/audio on the web have many similarities to whatever specific text/image/audio task you might want to solve. But structured data such as tabular data is much more heterogeneous. For instance, the dataset of Titanic survivors probably has little in common with your company’s supply chain data.
Now that it's possible to build and deploy machine learning models with very few examples, it’s also increasingly possible to build and launch products very quickly — perhaps without even bothering to collect and use a test set. This is an exciting shift. I’m confident that this will lead to many more exciting applications, including specifically ones where we don’t have much labeled data.