The paper that introduced the transformer famously declared, “Attention is all you need.” To the contrary, new work shows you may not need transformer-style attention at all.
What’s new: Hanxiao Liu and colleagues at Google Brain developed the gated multi-layer perceptron (gMLP), a simple architecture that performed some language and vision tasks as well as transformers.
Key insight: A transformer processes input sequences using both a vanilla neural network, often called a multi-layer perceptron, and a self-attention mechanism. The vanilla neural network works on relationships between each element within the vector representation of a given token — say, a word in text or pixel in an image — while self-attention learns the relationships between each token in a sequence. However, the vanilla neural network also can do this job if the sequence length is fixed. The authors reassigned attention’s role to the vanilla neural network by fixing the sequence length and adding a gating unit to filter out the least important parts of the sequence.
How it works: To evaluate gMLP in a language application, the authors pretrained it to predict missing words in the English version of the text database C4 and fine-tuned it to classify positive and negative sentiment expressed by excerpts from movie reviews in SST-2. For vision, they trained it on ImageNet using image patches as tokens.
- The model passed input sequences to a series of gMLP blocks, each of which contained a vanilla neural network, followed by a gating unit and another vanilla neural network.
- The vanilla neural networks processed a 768-element vector representation of each token individually to find relationships among the elements.
- The gating unit effectively zeroed out parts of the input to ensure they would have little effect on the output. It did this by multiplying the input by a learned vector such that, if values in the vector were near zero, the corresponding input values would be near zero.
- Different softmax layers learned to predict the next word in C4, classify sentiment in SST-2, and classify ImageNet.
Results: In tests, gMLP performed roughly as well as the popular transformer-based language model BERT. The authors compared the performance on C4 of comparably sized, pretrained (but not fine-tuned) models. gMLP achieved 4.28 perplexity, which measures a model’s ability to predict words in a test set (smaller is better), while BERT achieved 4.17 perplexity. On SST-2, gMLP achieved 94.2 percent accuracy, while BERT achieved 93.8 percent accuracy. The authors’ approach performed similarly well in image classification after training on ImageNet. gMLP achieved 81.6 percent accuracy compared to a DeiT-B’s 81.8 percent accuracy.
Why it matters: This model, along with other recent work from Google Brain, bolsters the idea that alternatives based on old-school architectures can approach or exceed the performance of newfangled techniques like self-attention.
We’re thinking: When someone invents a model that does away with attention, we pay attention!