One Model, Hundreds of Tasks Multimodal Transformer Performs Over 600 Different Tasks

Published

May 18, 2022

Reading time

2 min read

Researchers took a step toward achieving a longstanding goal: One model that performs a whole lot of very different tasks.

What's new: Scott Reed, Konrad Żołna, Emilio Parisotto and a team at DeepMind announced Gato, a model that performs over 600 diverse tasks including generating image captions, manipulating a physical robot arm, and playing Atari.

How it works: The authors trained the 1.2 billion-parameter transformer on seven vision-language tasks like MS-COCO Captions, an image and joint-angle dataset of stacking blocks with a real robot, recorded state-of-the-art simulations of 595 tasks like ALE Atari, plus the language dataset MassiveText.

The authors tokenized the data before input, turning images, text, button presses, robot arm torques, and so on into a sequence of vectors. Custom tokenizers were designed for different input types. For the simulated tasks, they interleaved observation tokens and action tokens.
They trained the transformer to predict the next token in a sequence. Given an image, it predicted captions; given observations, it predicted actions; given text, it predicted the following text. However, it didn’t predict tokens that represented images or agent observations.
During training, to cue the model about which simulated task it should perform, they added a prompt to the beginning of the input sequence 25 percent of the time. Half of those prompts consisted of a randomly sampled segment of observations and actions of the task at hand. For the other half, the prompt consisted of observations and actions from the end of the sequence, which served the dual purpose of telling the model what the goal was. This way, during inference, the model could be prompted with an example segment and then emulate it.

Results: In the simulated tasks, Gato achieved at least 50 percent of the score achieved in the recorded simulations of over 450 tasks. In ALE Atari, Gato matched or exceeded an average human score in 23 of 51 games, and it did at least twice as well in 11 of those 23. Gato successfully piloted a robot arm to stack a red block on top of a blue block (while ignoring a green block), in roughly 50 percent of the trials with previously unseen block shapes, comparable to a specialized baseline model, which achieved 49 percent.

What they’re saying: DeepMind’s research director, Nando de Frietas, used Gato’s achievements to argue that “it’s all about scale”: That larger models and better data are the keys to artificial general intelligence. New York University professor Gary Marcus rebutted this claim, pointing out that, alongside their increasingly brilliant results, large neural networks often generate baffling sentences, images, and behaviors.

Why it matters: This work is the latest, and most expansive, in a line of improvements in multimodal AI recently lately showcased by the impressive UNiT from Facebook. Transformers are well suited to a variety of tasks partly because they find patterns in long input sequences and because a variety of data types lend themselves to being divided into sequences to feed them.

We're thinking: Gato is an impressive engineering feat. We don’t find it so interesting that a giant neural network can do what 600 distinct, smaller networks could do. But evidence that Gato might generalize across different tasks is fascinating. Specifically, the authors pretrained Gato, fine-tuned it on four new tasks, and showed that, in three cases, the fine-tuned model outperformed models trained specifically for those tasks. We look forward to more research that evaluates the extent to which such networks, beyond memorizing various unrelated tasks, generalize across tasks and to new tasks. In other words, further progress in the direction indicated by the paper’s title: A Generalist Agent.

Subscribe to The Batch