Recent multimodal models process both text and images as sequences of tokens, but they learn to represent these distinct data types using separate loss functions. Recent work unifies the loss function as well.
What’s new: Wenhui Wang, Hangbo Bao, Li Dong, and colleagues at Microsoft introduced BEiT-V3, a transformer pretrained on a large amount of image, text, and paired image-text data. The model set a new state of the art in several vision-language tasks. This work updates the earlier BEiT and BEiT v2.
Key insight: MoME transformer (which the authors call Multiway) processes image, text, and text-image pairs using different fully connected layers for different data types, but the same self-attention layers for all. The authors who proposed that architecture trained it using a different task and loss function for text and image data. However, pretraining it on a single task and loss function for all data types — specifically, generating masked portions of the data — enables the shared self-attention layers to learn common patterns across data types, creating similar embeddings for similar images and texts.
How it works: BEiT-V3 is a 1.9 billion parameter MoME transformer.
- The authors pretrained the model to regenerate randomly masked input tokens in the 15 million images in ImageNet-21k, 160 gigabytes of internet text, and roughly 38 million image-text pairs (a combination of datasets) including COCO.
- They fine-tuned it for five vision-language tasks, such as identifying an object in an image based on a description (NLVR2), and four vision tasks such as ImageNet classification and COCO object detection and segmentation.
Results: BEiT-V3 outperformed baseline models across all nine tasks. On ImageNet, it achieved top-1 accuracy of 89.6 percent, beating the previous state of the art, 89 percent, achieved by FD-CLIP. On NLVR2, its accuracy was 92.6 percent accuracy, while the next-best model, CoCa, achieved 87 percent.
Why it matters: Sometimes great performance lies in a combination of tried-and-true techniques. BEiT-3 takes advantage of (a) the MoME architecture, (b) masked pretraining (which has achieved excellent fine-tuned performance on text, images, and text-image pairs), and (c) a large quantity of data (which has been shown to yield high performance).
We’re thinking: If earlier vision-language models are obsolete, so BEiT!