Learning From Metadata Descriptive Text Improves Performance for AI Image Classification Systems

Published

Jul 20, 2022

Reading time

2 min read

Images in the wild may not come with labels, but they often include metadata. A new training method takes advantage of this information to improve contrastive learning.

What’s new: Researchers at Carnegie Mellon University led by Yao-Hung Hubert Tsai and Tianqin Li developed a technique for learning contrastive representations that trains image classifiers on image metadata (say, information associated with an image through web interactions or database entries rather than explicit annotations).

Key insight: In contrastive learning, a model learns to generate representations that position similar examples nearby one another in vector space, and dissimilar examples distant from one another. If labels are available (that is, in a supervised setting), a model learns to cluster representations of examples with the same label and pushes apart those with different labels. If labels aren’t available (that is, in an unsupervised setting), it can learn to cluster representations of altered examples (say, flipped, rotated, or otherwise augmented versions of an image, à la SimCLR). And if unlabeled examples include metadata, the model can learn to cluster representations of examples associated with similar metadata. A combination of these unsupervised techniques should yield even better results.

How it works: The authors trained separate ResNets on three datasets: scenes of human activities whose metadata included 14 attributes including gender, hairstyle, and clothing style; images of shoes whose metadata included seven attributes like type, materials, and manufacturer; and images of birds whose metadata included 200 attributes that detail beak shape and colors of beaks, heads, wings, and breasts, and so on.

Given a set of images and metadata, the authors divided the images roughly evenly into many groups with similar metadata.
To each group, they added augmented variants (combinations of cropping, resizing, recoloring, and blurring) of every image in the group.
The ResNet generated a representation of each image. The loss function encouraged the model to learn similar representations for images within a group and dissimilar representations for images in different groups.
After training the ResNet, they froze its weights. They appended a linear layer and fine-tuned it on the dataset’s labels.

Results: The authors compared their method to a self-supervised contrastive approach (SimCLR) and a weakly supervised contrastive approach (CMC). Their method achieved greater top-1 accuracy than ResNets trained via the SimCLR in all three tasks. For instance, it classified shoes with 84.6 percent top-1 accuracy compared to SimCLR’s 77.8 percent. It achieved greater top-1 accuracy than ResNets trained via CMC in two tasks. For example, it classified human scenes with 45.5 percent top-1 accuracy compared to CMC’s 34.1 percent.

Yes, but: The supervised contrastive learning method known as SupCon scored highest on all three tasks. For instance, SupCon classified shoes with 89 percent top-1 accuracy.

Why it matters: Self-supervised, contrastive approaches use augmentation to improve image classification. A weakly supervised approach that takes advantage of metadata builds on such methods to help them produce even better-informed representations.

We’re thinking: The authors refer to bird attributes like beak shape as metadata. Others might call them noisy or weak labels. Terminology aside, these results point to a promising approach to self-supervised learning.

Subscribe to The Batch