The transformer architecture has shown an uncanny ability to model not only language but also images and proteins. New research found that it can apply what it learns from the first domain to the others.
What’s new: Kevin Lu and colleagues at UC Berkeley, Facebook, and Google devised Frozen Pretrained Transformer (FPT). After pretraining a transformer network on language data, they showed that it could perform vision, mathematical, and logical tasks without fine-tuning its core layers.
Key insight: Transformers pick up on patterns in an input sequence, be it words in a novel, pixels in an image, or amino acids in a protein. If different types of data share similar patterns, a transformer trained on one type can operate on another.
How it works: The researchers started with a 36-layer GPT-2 pretrained on WebText (posts on the website Reddit). They froze its self-attention and feed-forward layers and, in separate copies, fine-tuned peripheral layers on each on a wide range of tasks: Bit memory (memorizing strings of bits), Bit XOR (performing logical operations on pairs of strings of bits), ListOps (parsing and performing mathematical operations), MNIST, CIFAR-10 (classification of images), CFAR-10 LRA (classification of flattened, greyscale images), and remote homology detection (predicting what kind of protein structure an amino acid is part of).
- The authors fine-tuned only an input layer, an output layer, layer norm parameters (which fix the mean and variance of a layer’s input), and positional embeddings (vectors that represent where items appear in an input sequence) — less than 0.1 percent of the model’s parameters.
- To evaluate the impact of the language pretraining, the authors also built models whose core layers didn’t benefit from that training. They randomly initialized a GPT-2, froze its self-attention and feed-forward parameters, and then fine-tuned it in the same way as the others.
Results: They compared GPT-2 models trained using their method to GPT-2s that had been fully fine-tuned for the same tasks. Their approach performed nearly as well, sometimes better. For instance, on CIFAR-10, their approach achieved 72.1 percent accuracy versus the fully fine-tuned model’s 70.3 percent. On remote homology detection, their approach achieved 12.7 percent versus 9 percent. Language pre-training contributed to the improvement: For instance, on CIFAR-10, their model achieved 68.2 percent versus the randomized model’s 61.7 percent.
Why it matters: It appears that similar information structures — in the authors’ term, grammars — pervade the world. Applying representations learned in one domain to another domain may conserve training time and lead to better multimodal models.
We’re thinking: It’s surprising that cross-modal pretraining works this well! Are there underlying statistics, common to many types of sequences, that we don’t yet appreciate?