Large models pretrained in an unsupervised fashion and then fine-tuned on a smaller corpus of labeled data have achieved spectacular results in natural language processing. New research pushes forward with a similar approach to computer vision.

What’s new: Ting Chen and colleagues at Google Brain developed SimCLRv2, a training method for image recognition that outperformed the state of the art in self-supervised learning and beat fully supervised models while using a small fraction of the labels. The new work extends their earlier SimCLR, which The Batch reported on here.

Key insight: Larger models have proven more effective in self-supervised pretraining. But enormous models can be hard to deploy and run efficiently. SimCLRv2 starts with a giant feature extractor, fine-tunes the resulting features, and shrinks the final model using knowledge distillation. The result is a model of more reasonable size that achieves high accuracy despite training on relatively few labeled examples.

How it works: The most novel aspect of the original SimCLR was its use of image augmentation to train a feature extractor via contrastive learning. SimCLRv2 follows that pattern, but it uses deeper models and distills the trained architecture.

  • The authors started by pretraining a feature extractor to generate similar features from augmented versions of the same image, and dissimilar features from unrelated images.
  • Next, they fine-tuned the feature extractor using subsets of ImageNet.They ran experiments using either 1 percent or 10 percent of the labels.
  • The final step was knowledge distillation: A teacher model trained a student model to match its predictions on unlabeled data. The authors achieved equally good results from both self-distillation (in which the teacher and student share the same architecture) and conventional distillation (in which the student is a more compact model).

Results: A resnet-50 trained via SimCLRv2 using 10 percent of ImageNet labels outperformed a supervised resnet-50 trained on all the labels. It achieved a top-1 accuracy of 77.5 percent, an 8.7 percent improvement over the previous state of the art on the task with similar architecture and label constraints, versus the supervised model’s 76.6 percent. A resnet-152 (three times wider with selective kernels) trained via SimCLRv2 that used 1 percent of ImageNet labels matched a supervised resnet-50, achieving 76.6 percent top-1 accuracy. That’s 13.6 percent better than the previous best model trained on the same number of labels.

Why it matters: Techniques that make it possible to train neural networks effectively on relatively few labeled images could have an impact on small data problems such as diagnosing medical images and detecting defects on a manufacturing line, where labeled examples are hard to come by. The progress from SimCLR to SimCLRv2 bodes well for further advances.

We’re thinking: Self-supervised models tend to be huge partly because it isn’t clear initially what they’ll be used for, so they must learn lots of general-purpose features. Knowledge distillation looks like a promising way to trim the extra features for specific purposes in which a smaller network may suffice.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox