The ability of OpenAI’s CLIP to produce similar embeddings of a text phrase and a matching image (such as “a photo of a cat” and a photo of a cat) opened up applications like classifying images according to labels that weren’t in the training set. A new model extends this capability to seven data types.
What’s new: Rohit Girdhar, Alaaeldin El-Nouby, Ishan Misra, and colleagues at Meta developed ImageBind, a system that produces similar embeddings of text phrases, audio clips, images, videos, thermal images, depth images, and Inertial Measurement Unit (IMU) readings (which include accelerometer and gyroscope measurements).
Key insight: One challenge to learning multimodal embeddings is access to training data that includes matched pairs of all data types involved. For instance, matched image-text pairs, image-depth pairs, and image-thermal pairs are readily available, but pairings of text-thermal, text-depth, and so on are not. Learning to produce similar embeddings given pairings of one media type (in this case images) with other media types will transfer to pairings of pairings of that type with further types. There’s no need for specific training for each pairing.
How it works: ImageBind uses a separate transformer to embed each media type with one exception: The transformer that processes images handles video as well by treating a video as a two-frame image (sampled from the video).
- The training data comprised matched pairs of video-audio clips from YouTube, image-depth scenes shot by a depth camera, image-thermal pictures of street scenes at night, and video-IMU shot from a first-person point of view.
- Instead of training image and text encoders from scratch, the authors adopted the encoders from OpenCLIP, which is pretrained on billions of image-text pairs.
- The transformers learned via a contrastive loss function. Given an image (or video) and its match in another data type, the loss encouraged them to produce similar embeddings. Given an image (or video) and an example that didn’t match, it encouraged them to produce dissimilar embeddings.
Results: The authors use a method similar to CLIP to classify data using ImageBind. For example, using the Clotho test set of roughly 1,000 audio and text descriptions, ImageBind compared the embedding of a description with the embedding of every audio clip and returned the most similar audio clip. ImageBind returned the correct audio clip 6 percent of the time, whereas AVFIC, which learned using pairs of audio and text, returned the correct audio clip 3 percent of the time. However, ImageBind did not match supervised learning. ARNLQ, a supervised model, returned the correct audio 12.6 percent of the time.
Why it matters: The authors’ approach acts as an upgrade for models that generate similar embeddings for examples that have similar meanings in different media: To enhance the model’s repertoire with a new data type (say, audio), simply fine-tune it on relevant paired data (such as image, audio).
We’re thinking: ImageBind shows that machine learning models don’t need to learn from all pairs of data types to produce similar embeddings among various data types. Still, we can’t help but wonder how much its performance would improve if it did learn from other pairings, like (text, audio).