Leveling the Playing Field

Published

Sep 11, 2019

Reading time

2 min read

Deep reinforcement learning has given machines apparent hegemony in vintage Atari games, but their scores have been hard to compare — with one another or with human performance — because there are no rules governing what machines can and can’t do to win. Researchers aim to change that.

What’s new: Most AI research demonstrating superhuman performance in Atari games applies widely varying limits on gameplay, such as how frequently buttons can be pressed. Researchers from MINES ParisTech and Valeo offer a standardized setup: Standardized Atari Benchmark for Reinforcement Learning (Saber). They use it to achieve a new state of the art in around 60 games from Pong to Montezuma’s Revenge.

Key Insight: Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde noticed that the reported human world-record scores average 1,000 times higher than the “expert human player” scores given in the first major deep reinforcement learning paper published in late 2013. Analyzing the settings used in deep learning publications since, the team pinpointed seven potential causes for reported variations in performance.

How it works: The authors propose a set of guidelines designed to match human capabilities. Their benchmark includes a new metric for evaluating models, since the previous human benchmark misrepresents human capabilities.

Saber removes limitations on gaming time — it takes time for human players to rack up a world record! — rather than the few minutes many researchers allow.
The benchmark specifies that models can receive only the game screen as input, no further information allowed. For example, they must be able to use all buttons even if some don’t function.
The benchmark ranks models on a normalized scale in which 0 represents a score obtained by pressing buttons randomly and 1 is the human world record.

Results: The researchers tested a state-of-the-art model, Rainbow-IQN, and achieved an average of only 31% of the best human scores. The model achieved superhuman scores in four of 58 games.

Why it matters: Training reinforcement learning models is so laborious that researchers often don’t bother to reproduce previous results to see how their own stack up. Saber finally provides a consistent basis for comparison.

We’re thinking: Deep reinforcement learning research is exciting, but a lack of standardized benchmarks has kept the state of the art in a state of ambiguity. Saber signals a new and promising maturity.

Subscribe to The Batch