If the world changes every second and you take a picture every 10 seconds, you won’t have enough pictures to observe the changes clearly, and storing a series of pictures won’t help. On the other hand, if you take a picture every tenth of a second, then storing a history will help model the world. New research applies this principle to reinforcement learning.
What’s new: William Fedus and Prajit Ramachandran led researchers at Google Brain, MILA, University of Montreal, and DeepMind to refine experience replay, a fundamental technique in reinforcement learning. The outcome: a new hyperparameter.
Key insight: Experience replay enables an agent to store observations so it can apply past experiences to present conditions. However, the faster the environment changes, the less relevant past experiences become. The authors conclude that the ratio of stored observations to updates of the agent’s strategy is a previously unrecognized hyperparameter.
How it works: In reinforcement learning, an agent observes the environment at a given frame rate, chooses actions based on its observations, receives rewards for desirable actions, and learns to maximize the rewards.
- Experience replay retains a fixed number of the agent’s most recent observations in a buffer. The agent randomly samples observations and updates its strategy accordingly. This procedure enables the agent to learn from past experiences, so it doesn’t have to repeat painful lessons.
- The primary hyperparameter in experience replay is the number of observations the buffer holds, known as its capacity. The new hyperparameter, replay ratio, is a proxy for how fast the agent learns.
- If the ratio between buffer capacity and agent policy updates is too high, learning becomes dominated by outdated perspectives. If it’s too low, the limited selection of memories allows the agent to maintain outdated habits. Figure 1 above illustrates these relationships.
Results: The team tested the new hyperparameter using Atari games, a common RL benchmark. Increasing capacity to maintain a consistent ratio improved the agent’s performance. Reducing the ratio to focus the agent on more recent observations often helped as well (Figure 2).
Yes, but: If the ratio is too low, the agent may fall back into old habits or fail to discover the optimal strategy to achieve its goal.
Why it matters: Replay ratio wasn’t a focus of attention prior to this study. Now we know the ratio affects performance. That insight may add context to previous literature that considers only capacity.
We’re thinking: Like Goldilocks tasting porridge to find the bowl whose temperature is just right, it’s likely to take a bit of trial and error to find a given agent’s optimal replay ratio.