Multimodal AI Takes Off Multimodal Models, such as CLIP and DALL·E, are taking over AI.

Published

Dec 22, 2021

Reading time

1 min read

While models like GPT-3 and EfficientNet, which work on text and images respectively, are responsible for some of deep learning’s highest-profile successes, approaches that find relationships between text and images made impressive strides.

What happened: OpenAI kicked off a big year in multimodal learning with CLIP, which matches images and text, and DALL·E, which generates images that correspond to input text. DeepMind’s Perceiver IO classifies text, image, video, and point clouds. Stanford’s ConVIRT added text labels to medical X-ray images.
Driving the story: While the latest multimodal systems were mostly experimental, a few real-world applications broke through.

The open source community combined CLIP with generative adversarial networks to produce striking works of digital art. Artist Martin O’Leary used Samuel Coleridge’s epic poem “Kubla Khan” as input to generate the psychedelic scrolling video interpretation, “Sinuous Rills.”
Facebook said its multimodal hate-speech detector flagged 97 percent of the abusive and harmful content it removed from the social network. The system classifies memes and other image-text pairings as benign or harmful based on 10 data types including text, images, and video.
Google said it would add multimodal (and multilingual) capabilities to its search engine. Its Muiltitask Unified Model returns links to text, audio, images, and videos in response to queries in any of 75 languages.

Behind the news: The year’s multimodal momentum built upon decades of research. In 1989, researchers at Johns Hopkins University and University of California San Diego developed a system that classified vowels based on audio and visual data of people speaking. Over the next two decades, various groups attempted multimodal applications like indexing digital video libraries and classifying human emotions based on audiovisual data.

Where things stand: Images and text are so complex that, in the past, researchers had their hands full focusing on one or the other. In doing so, they developed very different techniques. Over the past decade, though, computer vision and natural language processing have converged on neural networks, opening the door to unified models that merge the two modes. Look for models that integrate audio as well.

Subscribe to The Batch