Training a model to separate the objects in a picture typically requires labeled images for best results. Recent work upped the ante for training without labels.
What’s new: Mark Hamilton and colleagues at Cornell, Google, and Massachusetts Institute of Technology developed Self-supervised Transformer with Energy-based Graph Optimization STEGO, an architecture and training method for semantic segmentation that substantially improved the state of the art for unsupervised learning of this task.
Key insight: A computer vision model pretrained on images produces similar representations of pixels that belong to similar objects, such as patches of sky. By clustering those representations, a model can learn to identify groups of pixels that share a label without referring to the labels themselves. (If the feature extractor learns in an self-supervised way, it doesn’t need labels either.)
How it works: A feature extractor (the transformer DINO, which was pretrained in an unsupervised manner on ImageNet) generated features for each pixel of input images. A vanilla neural network trained on COCO-Stuff refined the features into a representation of each pixel.
- DINO received an image and produced features for each pixel. The features were stored.
- During training, the vanilla neural network received the features of three images: the target image, an image with similar features (according to k-nearest neighbors), and a randomly selected image. Its loss function compared the representations it produced with the stored features and encouraged the model to make its representations similar to features of the similar image and different from features of the randomly selected image. This pushed the representations of similar pixels into tight clusters that would be easy to separate.
- At inference, given an image, DINO created pixel-wise features and the vanilla neural network produced representations. The authors grouped the representations via k-means clustering. Based on the clusters, they produced a segmentation map that showed which pixels belong to which objects.
Results: To measure how well their model separated the objects in an image, the authors used a matching algorithm to match grouped pixels with ground-truth labels (that is, they labeled the pixels). Their method achieved 28.2 percent mean intersection over union (the ratio of the number of correctly labeled pixels to total number of pixels, averaged over all classes) on the 27-class COCO-Stuff validation set. Its closest unsupervised rival, PiCIE+H, achieved 14.4 percent mean intersection over union. As for supervised approaches, the state-of-the-art, ViT-Adapter-L, achieved 52.9 percent mean intersection over union.
Why it matters: This system is designed to be easily upgraded as datasets and architectures improve. The authors didn’t fine-tune the feature extractor, so it could be swapped for a better one in the future. Upgrading would require retraining the relatively small vanilla neural network, which is faster and simpler than training a typical semantic segmentation model.
We’re thinking: Since it didn’t learn from labels, the authors’ vanilla neural network can’t identify the objects it segments. Could it learn to do that, CLIP-style, from images with corresponding captions?