Multitask Vision Transformer

Published

Dec 06, 2023

Reading time

3 min read

The original DINO showed that a vision transformer pretrained on unlabeled images could learn representations that were sufficient for classifying and segmenting images. In an update of that work, the model learned representations useful in a wider variety of tasks.

What’s new: Maxime Oquab, Timothée Darcet, Théo Moutakanni, and colleagues at Meta and France’s National Institute for Research in Digital Science and Technology released DINOv2, a vision transformer pretrained in a self-supervised manner that performs video classification, image retrieval, depth estimation and other vision tasks.

Key insight: Datasets of images scraped from the web can be very large, but they can also be surprisingly undiverse (for example, mostly pictures of pets). Images from smaller datasets that are known to be diverse can be used to find similar images on the web. In this way, it’s possible to scrape a large, diverse image dataset to train vision models using self-supervised methods.

How it works: The authors gathered 142 million images with diversity similar to curated data sets. They pretrained DINOv2, a large vision transformer (ViT) to embed the images using two loss functions gleaned from previous work.

The authors started with 1.2 billion uncurated images. They used smaller curated datasets such as ImageNet-22k (14.2 million images), ImageNet-1k (1.2 million), and Google Landmarks (4.1 million) to select a certain number of similar ones. They considered two images to be similar based on the cosine-similarity of embeddings computed by ViT-H/16 pretrained on ImageNet-22k.
Following the original DINO, the authors compared DINOv2’s classification to a teacher model’s classification. Specifically, they added an extra vanilla neural network and pretrained DINOv2 to match its classification of a cropped image to the teacher’s classification of a different crop of the same image. The teacher’s weights were the exponential moving average (average where the most recent versions matter exponentially more than the past ones) of iterations of DINOv2 earlier in the training process.
Following iBOT, they added a second vanilla neural network and pretrained DINOv2 to match its embeddings of a masked image’s patches to the teacher’s embeddings of the unmasked image’s patches.
Training on such a large image dataset took a lot of time, so the authors devised 9 methods to accelerate pretraining. For instance, they trained DINOv2 on images at low resolution (224 by 224 pixels) for most of the process. To enable DINOv2 to learn image details, they increased the resolution to 518 by 518 during the last 10,000 training steps.

Results: DINOv2 outperformed self-supervised vision transformers and weakly supervised vision transformers that use text annotations such as captions as labels (for example CLIP and OpenCLIP). The authors compared the models on a variety of tasks including image classification, video classification, semantic segmentation, and depth estimation. In each case, they froze DINOv2 and fine-tuned a linear classification layer on top of it.

DINOv2 achieved 86.3 percent accuracy on ImageNet, while CLIP achieved 85.3 percent accuracy. A fine-tuned MAE achieved 85.9 percent accuracy. DINOv2 and CLIP had 300 million parameters. MAE had 632 million parameters.
Given 8 evenly spaced video frames, DINOv2 learned to classify videos into 101 categories of actions such as ice dancing, surfing, diving) with 91.2 percent accuracy. OpenCLIP achieved 90.7 percent accuracy, and DINO 85.0 percent accuracy. DINOv2 had around 1 billion parameters, OpenCLIP 1.8 billion, and DINO 87 million.
Performing semantic segmentation (in which a model predicts to which object the pixels in an image belong to), DINOv2 fine-tuned on CityScapes achieved 71.3 mean IoU (intersection over union, the overlap between the predicted region and the ground-truth region, higher is better) over all object types in the CityScapes test set. DINO achieved 56.9 mean IoU, and OpenCLIP achieved 60.3 mean IoU. Parameter counts were the same as above.
Performing depth estimation, DINOv2 fine-tuned on KITTI achieved a 2.62 RMSE (root mean squared error, lower is better). DINO achieved 3.81 RMSE and OpenCLIP achieved 3.57 RMSE. Parameter counts were the same as above.

Why it matters: Self-supervised training on massive, diverse datasets has proven potent in language models. Similarly, existing self-supervised methods can deliver great image embeddings when trained on sufficiently large and diverse image datasets.

We’re thinking: We’ve been impressed by emergent capabilities of language models. We’re keen to see what further capabilities emerge from vision transformers.

Subscribe to The Batch