Few people have had a chance to try out OpenAI’s GPT-4 with Vision (GPT-4V), but many of those who have played with it expressed excitement.
What’s new: Users who had early access to the image-savvy update of GPT-4, which began a gradual rollout on September 24, flooded social media with initial experiments. Meanwhile, Microsoft researchers tested the model on a detailed taxonomy of language-vision tasks.
Fresh capabilities: Users on X (formerly Twitter) tried out the model in situations that required understanding an image's contents and contexts, reasoning over them, and generating appropriate responses.
- One user gave GPT-4V a photograph of a traffic pole festooned with several parking signs, entered the time and day, and asked, “Can I park here?” The model read the signs and correctly replied, “You can park here for one hour starting at 4PM.”
- Another built a “frontend engineer agent” that enabled the model to turn a screenshot of a webpage into code, then iteratively improve the program to eliminate coding and design errors.
- Shown a single frame from the 2000 Hollywood movie Gladiator, the model correctly identified Russell Crowe as the character Maximus Decimus Meridius and supplied Crowe’s dialogue (“are you not entertained?”).
- GPT-4V behaved like a personalized tutor when it was shown a diagram of a human cell and asked to describe its parts at a ninth-grade level.
Microsoft takes stock: Zhengyuan Yang and colleagues probed GPT-4V’s capabilities and evaluated prompting techniques in a wide variety of tasks that involve subtle interactions between images, words, and computer code. They reported only qualitative results — both positive and negative — leaving it to other researchers to compare the model’s performance with that of competitors like LLaVA.
- Researchers prompted the model visually. Highlighting areas of interest in an image with boxes or text labels further improved its performance.
- Presented with an out-of-order image sequence, GPT-4V identified which event came first and predicted what would happen next. Conversely, given an ordered sequence, it described the action.
- Given a photo of a coastal landscape and asked to reduce a viewer’s desire to visit, the model explained that the rocks were sharp and slippery and provided no place to swim.
- Given an MRI of a cranium and asked to write a report as an expert radiologist, it proposed the correct diagnosis, according to an “evaluation from professionals.”
- Image captions generated by GPT-4V contained more detail than ground-truth examples, leading the authors to conclude that existing benchmarks wouldn’t do justice to its ability to understand the contents of an image.
Yes, but: These qualitative examples are impressive, but they were cherry-picked to give only a glimpse of GPT-4V’s capabilities. Microsoft noted that the model’s behavior is inconsistent. It remains to be seen how reliably it can perform a given task.
Why it matters: GPT-4V is an early entry in a rising generation of large multimodal models that offer new ways to interact with text, images, and combinations of the two. It performs tasks that previously were the province of specialized systems, like object detection, face recognition, and optical character recognition. It can also adapt, alter, or translate images according to text or image prompts. The prospects for integration with image editors, design tools, coding tools, personal assistants, and a wide range of other applications are tantalizing.
We’re thinking: When the text-only version of GPT-4 became available, OpenAI didn’t report quantitative results for a couple of weeks (and it still hasn’t presented a detailed view of its architecture and training). We look forward to a clearer picture of what GPT-4V can do.