Transformer networks have revolutionized natural language processing, but they hog processor cycles and memory. New research demonstrates a more frugal variation.
What’s new: Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya at UC Berkeley and Google modified the transformer architecture to run faster while requiring orders of magnitude less memory during training. They call the new version Reformer.
Key insight: The transformer architecture is inherently inefficient: It tracks relationships among all input tokens, whether or not they matter to the output, and training requires a lot of memory. A few simple tweaks can rein in these excesses.
How it works: The researchers replaced the transformer’s feed-forward network with a reversible residual network. They modified the attention mechanism with locality-sensitive hashing.
- Typically, a transformer must keep all feed-forward layers in memory during training. In Reformer, each layer of the reversible residual network stores information that enables backpropagation to occur one layer at a time, rather than storing information about the entire network. That way, the network requires only enough memory to store one layer.
- A transformer’s attention mechanism encodes relationships between the current token and previous tokens, but usually only a few are important. Locality-sensitive hashing sorts the previous tokens into buckets according to similarity. Then Reformer computes attention relationships only within buckets.
Results: The authors ran experiments on Wikipedia text parceled into sequences of 64,000 tokens (more than double the number in the original transformer paper) in 16GB of memory. Reformer achieved almost the same performance as a transformer with an identical number of parameters while consuming less memory. Furthermore, the time required to compute LSH attention scaled more efficiently with increased sequence length.
Why it matters: Researchers seeking better performance are pumping up transformer-based models to immense sizes — Microsoft’s latest language model has 17 billion parameters. Running such behemoths can be out of reach for all but the largest corporate research labs. Reformer offers a more efficient alternative.
We’re thinking: Reformer’s improvements equip the transformer architecture for reading and generating long sequences — not only text, but also long-form video and audio. This capability could lead to larger-scale benchmarks to propel transformers into new tasks.