Vision Transformer and models like it use a lot of computation and memory when processing images. New work modifies these architectures to run more efficiently while adopting helpful properties from convolutions.
What’s new: Pranav Jeevan P and Amit Sethi at the Indian Institute of Technology Bombay proposed Convolutional Xformers for Vision (CXV), a suite of revamped vision transformers.
Key insight: The amounts of computation and memory required by a transformer’s self-attention mechanism rises quadratically with the size of its input, while the amounts required by linear attention scale linearly. Using linear attention instead should boost efficiency. Furthermore, self-attention layers process input images globally, while convolutions work locally on groups of adjacent pixels. So adding convolutions should enable a transformer to generate representations that emphasize nearby pixels, which are likely to be closely related. Convolutions offer additional benefits, too, such as translation equivariance (that is, they generate the same representation of a pattern regardless of its location in an image).
How it works: In each of three transformers, the authors added convolutions and replaced self- attention with a different variety of linear attention. One used Performers’ variation on linear attention, another used Nyströmformer’s, and the third used Linear Transformer’s. The models were trained to classify images in CIFAR-10, CIFAR-100, and TinyImageNet.
- Given an image, the models divided it into patches and applied a stack of convolutional layers that learned to generate a representation of each pixel.
- They processed the representations through consecutive modified transformer layers, each containing a convolutional layer, linear attention layer, and fully connected layer.
- The convolutional layer produced a different representation if an input image were rearranged so identical patches arrived in a different order. This obviated the need for a transformer’s usual position embeddings — vectors that encode the order of input data — which typically serve this purpose.
- A fully connected layer performed classification.
Results: All three CXV models consistently outperformed not only Vision Transformer but also previous models of the same size that used linear attention mechanisms from Performers, Nyströformer, and Linear Transformer models. They also outperformed ResNets an order of magnitude larger. For example, the CXV model (1.3 million parameters) outfitted with Performer’s linear attention achieved 91.42 percent accuracy on CIFAR-10 and required 3.2 GB of memory. A ResNet-18 (11.2 million parameters) achieved 86.29 percent, though it required only 0.6 GB of memory. Hybrid ViP-6/8 (1.3 million parameters), which also used Performer’s linear attention mechanism without convolutions, achieved 77.54 percent while using 5.9 GB of memory.
Yes, but: The authors experimented with low-resolution images (32x32 in CIFAR and 64x64 in TinyImageNet). Their results may have been more dramatic had they used higher-res images.
Why it matters: Researchers have looked to linear attention to make vision transformers more efficient virtually since the original Vision Transformer was proposed. Adding convolutions can give these architectures even more capability and flexibility, as shown by this work as well as LeViT, CvT, and ConvMixer.
We’re thinking: To paraphrase the great author Mark Twain, reports of the convolution’s death are greatly exaggerated.