Models that interpret the interplay of words and images tend to be trained on richer bodies of text than images. Recent research worked toward giving such models a more balanced knowledge of the two domains.

What’s new: Pengchuan Zhang and Xiujun Li led a team at Microsoft and University of Washington raised the bar in several vision-and-language tasks. They call their system Oscar+, building on earlier work that used class names of objects in an image to improve matching of image and text representations.

Key insight: Recent progress in vision-and-language models has come mostly by combining learned image and text representations more effectively rather than improving the representations themselves, the authors observed. Honing these representations through additional pretraining ought to boost their performance.
How it works: The authors started with pretrained representations for images and text generated by separate models for vision (ResNeXt-152 C4  pretrained on ImageNet-5k) and language (pretrained BERT). They honed the image representations by further pretraining the vision model on new data. Then they generated image-and-text representations as they pretrained Oscar+ as a whole. Finally, they fine-tuned the system on specific vision-and-language tasks.

  • In the additional pretraining step, the authors pretrained the ResNeXt-152 C4 to detect 1,848 objects or attributes (such as labels describing colors or textures) in 2.49 million images in four object detection datasets.
  • A transformer fused image and text representations as the authors pretrained Oscar+ on 8.85 million examples from four image caption datasets with generated image tags, image tag datasets with generated captions, and visual question-and-answer datasets. At this stage, the system optimized two loss terms. One term encouraged the system to predict randomly hidden words in a caption or an image’s tags. The other term encouraged the system to match an image and its tags, or an answer with its question and its image.
  • They fine-tuned the system to perform seven specific tasks.

Results: Oscar+ achieved state-of-the-art results in all seven tasks, from matching images with captions (and vice-versa) to determining the truth of a statement about two images. The system boosted NoCaps accuracy (captioning images that contain objects not seen in training) to 92.5 percent from 86.6 percent — its biggest gain. To show that performance was substantially improved by separately pretraining the object detector on additional data, the authors compared performance with and without that step. That step boosted visual question-answering accuracy, for instance, to 74.90 percent from 71.34 percent.

Why it matters: Performance in multimodal tasks can improve with additional learning in just one of the modes involved.

We’re thinking: If Oscar is a grouch, is Oscar+ nicer — or even more grumpy?


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox