It’s a Small World Model After All More efficient world models for reinforcement learning

Published

Jan 13, 2021

Reading time

2 min read

World models, which learn a compressed representation of a dynamic environment like, say, a video game, have delivered top results in reinforcement learning. A new method makes them much smaller.

What’s new: Jan Robine and colleagues at Heinrich Heine University Düsseldorf present Discrete Latent Space World Models. Their approach matches the performance of the state of the art in six Atari games, SimPLe, with far fewer parameters.

Key insight: Researchers have devoted significant effort to making reinforcement learning algorithms efficient, but they’ve given less attention to making models themselves efficient. Using high-performance architectures for the various components of a world model ought to improve the entire system — in this case, by reducing its size.

How it works: Following the typical world models approach, the authors trained separate neural networks to generate a representation of the environment (the representation model), predict how actions would affect the environment (the dynamics model), and choose the action that will bring the greatest reward (the policy model).

For the representation model, the authors used a vector quantized variational autoencoder (VQ-VAE) that’s smaller than the autoencoder in SimPLe. The VQ-VAE takes as input the pixels of a game’s most recent four frames. Its encoder generates a 6×6 matrix of indices, each pointing to a vector in an embedding that represents the environment. (After training, the decoder is no longer needed.)
For the dynamics model, they used a convolutional LSTM that takes as input the encoder’s output. They trained it to predict the reward and features of the next four frames. Errors backpropagate through to the embedding, so eventually it encodes information about predicted rewards and states. (After training, the dynamics model is no longer needed.)
For the policy model, they used a small convolutional neural network that also receives the encoder’s output. They trained it to choose an action using proximal policy optimization.
To train the system, the authors used the same iterative procedure as SimPLe. They let the system interact with the environment, trained the representation and dynamics models, and then trained the policy network; then they repeated the cycle.

Results: The authors compared their method to SimPLe in six Atari games. SimPLe uses 74 million parameters, while their method uses 12 million during training and 3 million during inference. Nonetheless, their method’s mean scores over five training runs beat SimPLe in five out of six games when given 100,000 observations.

Yes, but: Although the authors’ method beat SimPLe on average, SimPLe racked up higher scores in four out of six games.

Why it matters: Smaller models consume less energy, require less memory, and execute faster than larger ones, enabling machine learning engineers to perform more experiments in less time.

We’re thinking: World models are young enough that something as simple as changing the components used can make a big difference. This suggests that plenty of opportunity remains to improve existing models.

Subscribe to The Batch