Transformers famously require quadratically more computation as input size increases, leading to a variety of methods to make them more efficient. A new approach alters the architecture’s self-attention mechanism to balance computational efficiency with performance on vision tasks.
What's new: Pale-Shaped self-Attention achieved good vision results while applying self-attention to a grid-like pattern of rows and columns within an image. Sitong Wu led the work with colleagues at Baidu Research, Chinese National Engineering Laboratory for Deep Learning Technology and Application, and Chinese Academy of Sciences.
Key insight: Previous attempts to reduce the computational cost of self-attention include axial self-attention, in which a model divides an image into patches and applies self-attention to a single row or column at a time, and cross-shaped attention, which processes a combined row and column at a time. The pale-shaped version processes patches in a pattern of rows and columns (one meaning of “pale” is fence, evoking the lattice of horizontal rails and vertical pickets). This enables self-attention to extract large-scale features from a smaller portion of an image.
How it works: The authors implemented their pale-shaped scheme in Pale Transformer, which processed an image through alternating convolutional layers and 2 or 16 transformer blocks. They trained it on ImageNet.
- The authors divided the input image into patches.
- The convolutional layers reduced the size of the image by a factor of 2 or 4.
- In each transformer block, the self-attention mechanism divided the input patches into sets of 7 overlapping, evenly spaced rows and columns. It processed each set of rows and each set of columns separately. Then it concatenated the resulting representations and passed them along to the next convolutional layer or transformer block.
- The last transformer block fed a fully connected layer for classification.
Results: The authors tested three variants of Pale Transformer, each with a different number of parameters: Pale-T (Tiny, 22 million parameters), Pale-S (Small, 48 million parameters), and Pale-B (Base, 85 million parameters). Each achieved better top-1 classification accuracy on ImageNet than competing convolutional neural networks and transformers of similar size. For example, Pale-B achieved state-of-the-art accuracy of 85.8 percent while the best competing model, VOLO-D2 (59 million parameters), scored 85.2 percent. Pale-B required somewhat more computation (15.6 gigaflops) than VOLO-D2 (14.1 gigaflops), but both required far less than a vision transformer with 86 million parameters (55.4 gigaflops). The authors also compared Pale-T against axial and cross-shaped attention. Pale-T achieved 83.4 percent accuracy on ImageNet. The same model with axial attention achieved 82.4 percent and, with cross-shaped attention, achieved 82.8 percent.
Why it matters: This work suggests that there’s room to improve the transformer’s tradeoff between efficiency and performance by changing the way inputs are processed.
We’re thinking: Will this team’s next project be beyond the pale?