Large language models, or LLMs, have transformed how we process text. Large vision models, or LVMs, are starting to change how we process images as well. But there is an important difference between LLMs and LVMs:
- Internet text is similar enough to companies' proprietary text that an LLM trained on internet text can usually understand your proprietary documents.
- But many practical vision applications use images that look nothing like internet images. In these settings, you might do much better with a domain-specific LVM that has been adapted to your particular application domain.
This week, Dan Maloney and I announced Landing AI's work on developing domain-specific LVMs. You can learn more about it in this short video (4 minutes).
The internet – especially sites like Instagram – has numerous pictures of people, pets, landmarks, and everyday objects. So a generic LVM (usually a large vision transformer trained using a self-supervised learning objective on unlabeled images scraped from the internet) learns to recognize salient features in such images.
But many industry-specific applications of computer vision involve images that look little like internet images. Pathology applications, for instance, process images of tissue samples captured using high-powered microscopes. Alternatively, manufacturing inspection applications might work with numerous images centered on a single object or part of an object, all of which were imaged under similar lighting and camera configurations.
While some pathology and some manufacturing images can be found on the internet, their relative scarcity means that most generic LVMs do poorly at recognizing the most important features in such images.
In experiments conducted by Landing AI's Mark Sabini, Abdelhamid Bouzid, and Bastian Renjifo, LVMs adapted to images of a particular domain, such as pathology or semiconductor wafer inspection, do much better at finding relevant features in images of that domain. Building these LVMs can be done with around 100,000 unlabeled images from that domain, and larger datasets likely would result in even better models.
Further, if you use a pretrained LVM together with a small labeled dataset to tackle a supervised learning task, a domain specific LVM needs significantly less (around 10 percent to 30 percent as much) labeled data to achieve performance comparable to using a generic LVM.
Consequently, I believe domain specific LVMs can help businesses with large, proprietary sets of images that look different from internet images unlock considerable value from their data.
Of course, LVMs are still young, and much innovation lies ahead. My team is continuing to experiment with different ways to train domain-specific LVMs, as well as exploring how to combine such models with text to form domain-specific large multimodal models. I'm confident that LVMs will achieve many more breakthroughs in the coming years.