Efficient Reinforcement Learning IRIS used reinforcement learning to master Atari games with little gameplay.

Reading time
2 min read
Transformer-based system simulating simulate the Atari game "Pong"

Both transformers and reinforcement learning models are notoriously data-hungry. They may be less so when they work together.

What's new: Vincent Micheli and colleagues at the University of Geneva trained a transformer-based system to simulate Atari games using a small amount of gameplay. Then they used the simulation to train a reinforcement learning agent, IRIS, to exceed human performance in several games.

Key insight: A transformer excels at predicting the next item in a sequence. Given the output of a video game, it can learn to estimate a reward for the player’s button press and predict tokens that represent the next video frame. Given these tokens, an autoencoder can learn to reconstruct the frame. Together, the transformer and autoencoder form a game simulator that can help a reinforcement learning agent learn how to play.

How it works: For each of the 26 games in Atari 100k, in a repeating cycle, (i) a reinforcement learning agent played for a short time without learning, (ii) a system learned from the game frames and agent’s button presses to simulate the game, and (iii) the agent learned from the simulation. The total amount of gameplay lasted roughly two hours — 100,000 frames and associated button presses — per game.

  • The agent, which comprises a convolutional neural network followed by an LSTM, played the game for 200 frames. It received a frame and responded by pressing a button (randomly at first). It received no rewards and thus didn’t learn during gameplay.
  • Given a frame, an autoencoder learned to encode it into a set of tokens and reconstruct it from the tokens.
  • Given tokens that represented recent frames and button presses, a transformer learned to estimate the reward for the last button press and generate tokens that represented the next frame. The transformer also learned to estimate whether the current frame would end the game.
  • Given the tokens for the next frame, the autoencoder generated the image. Given the image, the agent learned to choose the button press that would maximize its reward.
  • The cycle repeated: The agent played the game, generating new frames and button presses to train the autoencoder and transformer. In turn, the autoencoder’s and transformer’s outputs trained the agent.

Results: The authors’ agent beat the average human score in 10 games including Pong. It also beat state-of-the-art approaches that include lookahead search (in which an agent chooses button presses based on predicted frames in addition to previous frames) in six games and those without lookahead search in 13 games. It worked best with games that don’t involve sudden changes in the gaming environment; for instance, when a player moves to a different level.

Why it matters: Transformers have been used in reinforcement learning, but as agents, not as world models. In this work, a transformer acted as a world model — it learned to simulate a game or environment — in a relatively sample-efficient way (100,000 examples). A similar approach could lead to high-performance, sample-efficient simulators.

We're thinking: The initial success of Atari-playing models was exciting partly because the reinforcement learning approach didn’t require building or using a model of the game. A model-based reinforcement learning approach to solving Atari is a surprising turn of events.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox