Transformers Take Over Transformers Applied to Vision, Language, Video, and More

Published

Dec 22, 2021

Reading time

2 min read

The transformer architecture extended its reach to a variety of new domains.

What happened: Originally developed for natural language processing, transformers are becoming the Swiss Army Knife of deep learning. In 2021, they were harnessed to discover drugs, recognize speech, and paint pictures — and much more.

Driving the story: Transformers had already proven adept at vision tasks, predicting earthquakes, and classifying and generating proteins. Over the past year, researchers pushed them into expansive new territory.

TransGAN is a generative adversarial network that incorporates transformers to make sure each generated pixel is consistent with those generated before it. The work achieved state-of-the-art results in measurements of how closely generated images resembled the training data.
Facebook’s TimeSformer used the architecture to recognize actions in video clips. Rather than the usual sequence of words in text, it interprets the sequence of video frames. It outperformed convolutional neural networks, analyzing longer clips in less time and using less power.
Researchers at Facebook, Google, and University of California Berkeley trained a GPT-2 on text and then froze its self-attention and feed-forward layers. They were able to fine-tune it for a wide variety of domains including mathematics, logic problems, and computer vision.
DeepMind released an open-source version of AlphaFold 2, which uses transformers to find the 3D shapes of protein based on their sequence of amino acids. The model has excited the medical community for its potential to fuel drug discovery and reveal biological insights.

Behind the news: The transformer debuted in 2017 and quickly revolutionized language modeling. Its self-attention mechanism, which tracks how each element in a sequence relates to every other element, suits it to analyze sequences of not only words but also pixels, video frames, amino acids, seismic waves, and so on. Large language models based on transformers have taken center-stage as examples of an emerging breed of foundation models — models pretrained on a large, unlabeled corpus that can be fine-tuned for specialized tasks on a limited number of labeled examples. The fact that transformers work well in a variety of domains may portend transformer-based foundation models beyond language.

Where things stand: The history of deep learning has seen a few ideas that rapidly became pervasive: the ReLU activation function, Adam optimizer, attention mechanism, and now transformers. The past year’s developments suggest that this architecture is still coming into its own.

Subscribe to The Batch