My team at Landing AI just announced a new tool for quickly building computer vision models, using a technique we call Visual Prompting. It’s a lot of fun! I invite you to try it.
Visual Prompting takes ideas from text prompting — which has revolutionized natural language processing — and applies them to computer vision.
To build a text sentiment classifier, in the traditional machine learning workflow, you have to collect and label a training set, train a model, and deploy it before you start getting predictions. This process can take days or weeks.
In contrast, in the prompt-based machine learning workflow, you can write a text prompt and, by calling a large language model API, start making predictions in seconds or minutes.
- Traditional workflow: Collect and label -> Train -> Predict
- Prompt-based workflow: Prompt -> Predict
To explain how these ideas apply to computer vision, consider the task of recognizing cell colonies (which look like white blobs) in a petri dish, as shown in the image below. In the traditional machine learning workflow, using object detection, you would have to label all the cell colonies, train a model, and deploy it. This works, but it’s slow and tedious.
In contrast, with Visual Prompting, you can create a “visual prompt” in seconds by pointing out (by painting over) one or two cell colonies in the image and similarly pointing out the background region, and get a working model. It takes only a few seconds to (i) create the visual prompt and (ii) get a result. If you’re not satisfied with the initial model, you can edit the prompt (perhaps by labeling a few more cell colonies), check the results, and keep iterating until you’re satisfied with the model’s performance.
The resulting interaction feels like you’re having a conversation with the system. You’re guiding it by incrementally providing additional data in real time.
Since 2017, when the paper that introduced transformers was published, rapid innovation in text processing has transformed natural language models. The paper that introduced vision transformers arrived in 2020, and similarly it led to rapid innovation in vision. Large pretrained models based on vision transformers have reached a point where, given a simple visual prompt that only partially (but unambiguously) specifies a task, they can generalize well to new images.
We’re not the only ones exploring this theme. Exciting variations on Visual Prompting include Meta’s Segment Anything (SAM), which performs image segmentation, and approaches such as Generalist Painter, SegGPT, and prompting via inpainting.
Text prompting reached an inflection point in 2020, when GPT-3 made it easy for developers to write a prompt and build a natural language processing model. I don’t know if computer vision has reached its GPT-3 moment, but we’re getting close. I’m excited by the research that’s moving us toward that moment, and I think Visual Prompting will be one key to getting us there.