In training an image recognition model, it’s not uncommon to augment the data by cropping original images randomly. But if an image contains several objects, a cropped version may no longer match its label. Researchers developed a way to make sure random crops are labeled properly.
What’s new: Led by Sangdoo Yun, a team at Naver AI Lab developed ReLabel, a technique that labels any random crop of any image. They showcased their method on ImageNet.
Key insight: Earlier work used knowledge distillation: Given a randomly cropped image, a so-called student model learned from labels predicted by a teacher model. That approach requires that the teacher predict a label for each of many cropped versions of a given example. In this work, an image was divided into a grid, and the teacher predicted a label for each grid square, creating a map of regions and their labels that was used to determine a label for any given portion of the image. This way, the teacher could examine each example only once, making the process much more efficient.
How it works: The teacher was an EfficientNet-L2 that had been pretrained on Google’s JFT-300M dataset of 300 million images. The student was a ResNet-50.
- The authors removed the teacher’s final pooling layer, so the network would predict a label for each region in a 15×15 grid instead of one label for the whole image. They used the teacher to predict such a “label map” for every image in ImageNet.
- The researchers trained the student using random crops of images in ImageNet and their corresponding label maps. Given a cropped image, they used RoIAlign to find the regions within the label map that aligned with the crop and pooled the corresponding regions into a vector. Then they used softmax to turn the vector into the probability distribution that is the label.
Results: The researchers compared a ResNet-50 trained on ImageNet using their labels to one trained using the standard labels. The new labels improved test classification accuracy from 77.5 percent to 78.9 percent.
Why it matters: Images on social and photo-sharing sites tend to be labeled with tags, but a tag that reads, say, “ox” indicates only that an ox appears somewhere in the image. This approach could enable vision models to take better advantage of data sources like this.
We’re thinking: A bounding box around every object of interest would ameliorate the cropping problem — but such labels aren’t always easy to get.