The transformer architecture has inspired a plethora of variations. Yet researchers have used a patchwork of metrics to evaluate their performance, making them hard to compare. New work aims to level the playing field.
What’s new: Yi Tay and colleagues at Google developed Long-Range Arena, a suite of benchmarks designed to standardize comparisons between transformers. The term long-range refers to transformers’ ability to capture dependencies between tokens in an input sequence that are far apart.
Key insight: The power of the original transformer lies in its ability to track relationships between tokens anywhere in an input sequence. But that power comes at a cost: The model slows down and its memory requirement rises dramatically as the length of its input increases. Many variants were created to alleviate this issue. These models cry out for tests that challenge their ability over long sequences.
How it works: The suite comprises six tests designed to probe different aspects of transformer behavior.
- Long ListOps requires a model to calculate the numeric output from a long list of ordered operations; for instance, to determine that [MAX 4 3 [MIN 2 3 ] 1 0 [MEDIAN 1 5 8 9, 2]] equals 5. It investigates how well a model can parse long sequences.
- Character-Level Text Classification is a binary sentiment classification task based on movie reviews in the IMDb dataset. Its objective is to test a model’s accuracy when processing documents up to 4,000 characters long.
- Character-Level Document Retrieval evaluates similarity between two documents using the ACL Anthology Network dataset, which identifies when one paper cites another. It assesses a model’s ability to compress text inputs for comparison.
- Image Classification on Sequences of Pixels classifies objects in CIFAR-10 images that have been flattened into a one-dimensional sequence. This tests how well a model learns spatial relationships between pixels.
- Pathfinder and Pathfinder-X require a model to decide whether two circles are connected by a path that consists of dashes in a generated image. Pathfinder-X increases the difficulty by making the image area 16 times larger. Both tasks test how well a model learns long-range spatial relationships. Pathfinder-X also gauges how the results change if the sequence length is much longer.
Results: The authors tested 10 transformers. While some shined in one task or another, none was the clear front runner. Big Bird achieved the highest average accuracy – 55.01 percent across five tasks — but it didn’t achieve the top score in any single task. Performer, dominated character-level text classification, performing 5.7 times faster than a vanilla transformer. Linformer used the least memory, 9.58 times less than the vanilla transformer. All models failed Pathfinder-X: Their ability to classify the image was no better than random chance, showing that longer input sequences inhibited performance.
Why it matters: Standardized comparisons not only help application developers choose the right model for their needs, they also provide a deeper understanding of a model’s performance characteristics and spur researchers to improve the state of the art.
We’re thinking: You can get involved, too. The team open sourced its work and encourages others to contribute to the Long-Range Arena leaderboard.