Work on vision transformers exploded in 2022.
What happened: Researchers published an abundance of ViT papers during the year. A major theme: combining self-attention and convolution.
Driving the story: A team at Google Brain introduced vision transformers (ViTs) in 2020, and the architecture has undergone nonstop refinement since then. The latest efforts adapt ViTs to new tasks and address their shortcomings.
- ViTs learn best from immense quantities of data, so researchers at Meta and Sorbonne University concentrated on improving performance on datasets of (merely) millions of examples. They boosted performance using transformer-specific adaptations of established procedures such as data augmentation and model regularization.
- Researchers at Inha University modified two key components to make ViTs more like convolutional neural networks. First, they divided images into patches with more overlap. Second, they modified self-attention to focus on a patch's neighbors rather than on the patch itself, and enabled it to learn whether to weigh neighboring patches more evenly or more selectively. These modifications brought a significant boost in accuracy.
- Researchers at the Indian Institute of Technology Bombay outfitted ViTs with convolutional layers. Convolution brings benefits like local processing of pixels and smaller memory footprints due to weight sharing. With respect to accuracy and speed, their convolutional ViT outperformed the usual version as well as runtime optimizations of transformers such as Performer, Nyströformer, and Linear Transformer. Other teams took similar approaches.
Behind the news: While much ViT research aims to surpass and ultimately replace convolutional neural networks (CNNs), the more potent trend is to marry the two. The ViT’s strength lies in its ability to consider relationships between all pixels in an image at small and at large scales. One downside is that it needs additional training to learn in ways that are baked into the CNN architecture after random initialization. CNN’s local context window (within which only local pixels matter) and weight sharing (which enables it to process different image locations identically) help transformers to learn more from less data.
Where things stand: The past year expanded the Vision Transformer’s scope in a number of applications. ViTs generated plausible successive video frames, generated 3D scenes from 2D image sequences, and detected objects in point clouds. It's hard to imagine recent advances in text-to-image generators based on diffusion models without them.