Solve RL With This One Weird Trick How to get better performance from reinforcement learning.

Published

Aug 18, 2021

Reading time

2 min read

The previous state-of-the-art model for playing vintage Atari games took advantage of a number of advances in reinforcement learning (RL). The new champion is a basic RL architecture plus a trick borrowed from image generation.

What’s new: A team led by Florin Gogianu, Tudor Berariu, and colleagues found that spectral normalization, a technique that limits the degree of variation between representations of similar inputs, improved an RL model’s performance more than several recent innovations combined. The team included researchers at Bitdefender, Deepmind, Imperial College London, Technical University of Cluj-Napoca, and University College London.

Key insight: In reinforcement learning, a model observes its environment (say, the Atari game Pong), chooses an action based on its observation (such as moving the paddle), and receives a reward for a desirable outcome (like scoring a point). Learning in this way can be difficult because, as a model selects different actions, its training data (observations and rewards) change. Mutable training data poses a similar problem for generative adversarial networks (GANs), where generator and discriminator networks influence each other even as they themselves change. Spectral normalization has been shown to help GANs learn by moderating these changes. It could also be beneficial in reinforcement learning.

How it works: The authors added spectral normalization to a C51, a convolutional neural network designed for reinforcement learning. The authors trained their model on tasks in the Arcade Learning Environment, a selection of games in which the actions are valid Atari controller movements.

Given an observation, a C51 predicts a set of distributions of the likely reward for taking each possible action. Then it selects the action that would bring the highest expected reward. During training, it refines its prediction by sampling and comparing predicted rewards to actual rewards.
Spectral normalization constrains parameter values in network layers, such that the distance between any two predictions is, at most, the distance between the inputs times a constant factor (chosen by the user). The smaller the factor, the more similar a network’s predictions must be. During training, spectral normalization limits the magnitude of a layer’s weights. If an update exceeds that limit, it divides the weights evenly so their magnitude is equal to the limit.
The authors argue that limiting weight changes is akin to dampening learning rates. They devised an optimization method that lowered the model’s learning rate proportionately to spectral normalization’s limit on the weights. Models trained either way performed nearly equally.

Results: Using spectral normalization on every layer impeded performance, but using it on only the second-to-last layer led the model to achieve a higher median reward. The authors compared their C51 with spectral normalization on the second-to-last layer against Rainbow, the previous state of the art, which outfits a C51 with a variety of RL techniques. In 54 Atari games, the authors’ approach achieved a 248.45 median reward, outperforming Rainbow’s 227.05 median reward.

Why it matters: Applying techniques from one area of machine learning, such as GANs, to a superficially different area, such as RL, can be surprisingly fruitful! In this case, it opens the door to much simpler RL models and perhaps opportunities to improve existing techniques.

We’re thinking: People who have expertise in multiple disciplines can be exceptionally creative, spotting opportunities for cross-fertilization among disparate fields. AI is now big enough to offer a cornucopia of opportunities for such interdisciplinary insight.

Subscribe to The Batch