Reinforcement Learning Transformed Transformers succeed at reinforcemend learning tasks.

Published

Dec 08, 2021

Reading time

3 min read

Transformers have matched or exceeded earlier architectures in language modeling and image classification. New work shows they can achieve state-of-the-art results in some reinforcement learning tasks as well.

What’s new: Lili Chen and Kevin Lu at University of California Berkeley with colleagues at Berkeley, Facebook, and Google developed Decision Transformer, which models decisions and their outcomes.

Key insight: A transformer learns from sequences, and a reinforcement learning task can be modeled as a repeating sequence of state, action, and reward. Given such a sequence, a transformer can learn to predict the next action (essentially recasting the reinforcement learning task as a supervised learning task). But this approach introduces a problem: If the transformer chooses the next action based on earlier rewards, it won’t learn to take actions that, though they may bring negligible rewards on their own, lay a foundation for winning higher rewards in the future. The solution is to tweak the reward part of the sequence. Instead of showing the model the reward for previous actions, the authors provided the sum of rewards remaining to be earned by completing the task. This way, the model took actions likely to reach that sum.

How it works: The researchers trained a generative pretrained transformer (GPT) on recorded matches of three types of games: Atari games with a fixed set of actions, OpenAI Gym games that require continuous control, and Key-to-Door. Winning Key-to-Door requires learning to pick up a key, which brings no reward, and using it to open a door and receive a reward.

The transformer generated a representation of each input token using a convolutional layer for visual inputs (Key-to-Door and Atari screens) and a linear layer for other types of input (actions, rewards, and, in OpenAI games, state).
During training, it received tokens for up to 50 reward-state-action triplets. For instance, in the classic Atari game Pong, the sum of all rewards for completing the task might be 100. The first action might yield 10 points, so the sum in the next triplet would fall to 90; the state would be the screen image, and the action might describe moving the paddle to a new position. In Key-to-Door, the sum of all rewards for completing the task remained 1 throughout the game (the reward for unlocking the door at the very end); the state was the screen; and the action might be a move in a certain direction.
At inference, instead of receiving the sum of rewards remaining to be earned, the model received a total desired reward — the reward the authors wanted the model to receive by the end of the game. Given an initial total desired reward and the state of the game, the model generated the next action. Then the researchers reduced the total desired reward by the amount received for performing the action, and so on.
For all games except Key-to-Door, the total desired reward exceeded the greatest sum of rewards for that game in the training set. This encouraged the model to maximize the total reward.

Results: The authors compared Decision Transformer with the previous state-of-the-art method, Conservative Q-Learning (CQL).They normalized scores of Atari and OpenAI Gym games to make 0 on par with random actions and 100 on par with a human expert. In Atari games, the authors’ approach did worse, earning an average score of 98 versus CQL’s 107. However, it excelled in the more complex games. In OpenAI Gym, averaged 75 versus CQL’s 64. In Key-to-Door, it succeeded 71.8 percent of the time versus CQL’s 13.1 percent.

Why it matters: How to deal with actions that bring a low reward in the present but contribute to greater benefits in the future is a classic issue in reinforcement learning. Decision Transformer learned to solve that problem via self-attention during training.

We’re thinking: It’s hard to imagine using this approach for online reinforcement learning, as the sum of future rewards would be unknown during training. That said, it wouldn’t be difficult to run a few experiments, train offline, and repeat.

Subscribe to The Batch