Vision Transformer (ViT) outperformed convolutional neural networks in image classification, but it required more training data. New work enabled ViT and its variants to outperform other architectures with less training data.
What’s new: Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song at Inha University proposed two tweaks to transformer-based vision architectures.
Key insight: ViT and its variants divide input images into smaller patches, generates a representation — that is, a token — of each patch, and applies self-attention to track the relationships between each pair of tokens. Dividing an image can obscure the relationships between its parts, so adding a margin of overlap around each patch can help the attention layers learn these relationships. Moreover, an attention layer may fail to distinguish sufficiently between strong and weak relationships among patches, which interferes with learning. For instance, it may weight the relationship between a background patch and a foreground patch only slightly lower than that between two foreground patches. Enabling the attention layers to learn to adjust such values should boost the trained model’s performance.
How it works: Starting with a collection of transformer-based image classifiers, the authors built modified versions that implemented two novel techniques. The models included ViT, T2T, CaiT, PiT, and Swin. They were trained on datasets of 50,000 to 100,000 images (CIFAR-10, CIFAR-100, Tiny-ImageNet, and SVHN) as well as the standard ImageNet training set of 1,281 million images.
- The first modification (shifted patch tokenization, or SPT) created overlap between adjacent patches. Given an image, the model produced four copies, then shifted each copy diagonally in a different direction by half the length of a patch. It divided the image into patches and concatenated the corresponding patches. Given the concatenated patches, it created a representation.
- The second modification (locality self-attention, or LSA) altered the self-attention mechanism. Given the matrix computed by the dot-product between the patches (typically the first step in self-attention), the model masked the diagonal. That is, it set to negative infinity every value that represented the strength of relationships between corresponding patches, causing the model to ignore relationships between them. It also rescaled the matrix using a learned parameter, so the model increased the weight of the closest relationships while decreasing the others.
Results: The alterations boosted the top-1 accuracy of all models on all datasets. They improved the accuracy of PiT and CaiT by 4.01 percent and 3.43 percent on CIFAR100, and the accuracy of ViT and Swin by 4.00 percent and 4.08 percent on Tiny-ImageNet. They improved the ImageNet accuracy of ViT, PiT, and Swin by 1.60 percent, 1.44 percent, and 1.06 percent respectively.
Yes, but: The authors also applied their alterations to the convolutional architectures ResNet and EfficientNet. Only CaiT and Swin surpassed them on CIFAR100 and SVHN. Only CaiT beat them on Tiny-ImageNet. No transformer beat ResNet’s performance on CIFAR10, though all the modified transformers except ViT beat ResNet on the same task.
Why it matters: Transformers have already revolutionized natural language processing, now they are poised to do the same for computer vision. The authors’ approach makes transformers more practical for visual tasks in which training data is limited.
We’re thinking: Transformers are making great strides in computer vision. Will they supplant convolutional neural networks? Stay tuned!