As transformer networks move to the fore in applications from language to vision, the time it takes them to crunch longer sequences becomes a more pressing issue. A new method lightens the computational load using sparse attention.
What’s new: BigBird, an attention mechanism developed by a Google team led by Manzil Zaheer and Guru Guruganesh, enables transformers to process long sequences more efficiently. Their work follows a similar effort using an entirely different method, linear attention.
Key insight: Recent research showed that transformers are Turing-complete, meaning they can learn to compute any algorithm, and universal approximators, meaning they can learn nearly any sequence-to-sequence function. The authors focused on approaches to accelerating transformers that maintain these two theoretical properties.
How it works: The basic transformer’s multiheaded self-attention mechanism compares every pair of tokens in an input sequence, so the amount of computation required grows quadratically with sequence length. Where linear attention would shrink the computation budget by reformulating the problem using the kernel trick, BigBird combines three sparse attention mechanisms that keep the number of comparisons constant: window attention, global attention, and random attention.
- Window attention compares only nearby tokens. This is important because nearby tokens affect one another.
- Global attention compares a constant number of tokens to every other token. Across multiple layers, it offers an indirect way to consider how every token relates to every other token, even though all tokens aren’t compared directly.
- Random attention compares a randomly selected number of tokens. This prevents a transformer from missing important details that windowed and global attention don’t cover, according to graph theory.
- This combination makes BigBird Turing-complete and a universal approximator.
Results: A model equipped with BigBird processed text sequences eight times longer than a RoBerta baseline while using 16GB of memory. A Longformer model designed for long sequences required 48GB and half the batch size to process the same sequence length. Longer sequences enabled BigBird to achieve masked language modeling (MLM) score, in which lower numbers indicate a better prediction of words missing from text, of 1.274 MLM compared with the Roberta baseline’s 1.469 MLM. BigBird also outperformed RoBerta on Natural Questions, HotpotQA, TriviaQA, and WikiHop.
Yes, but: To achieve such results, BigBird required more hyperparameter fine-tuning and architecture search than typical self-attention.
Why it matters: The ability to process longer sequences efficiently points toward faster training, lower memory requirements, higher benchmark scores, and potentially new applications that require keeping track of book-length sequences. The benefits of Turing completeness and universal approximation are theoretical for now, but BigBird ensures that they won’t fall by the wayside.
We’re thinking: The paper is 50 pages long. Now maybe transformer models, at least, can read it in one sitting.