The Trouble With Reinforcement Learning

Published

Aug 03, 2022

Reading time

2 min read

Dear friends,

While working on Course 3 of the Machine Learning Specialization, which covers reinforcement learning, I was reflecting on how reinforcement learning algorithms are still quite finicky. They’re very sensitive to hyperparameter choices, and someone experienced at hyperparameter tuning might get 10x or 100x better performance. Supervised deep learning was equally finicky a decade ago, but it has gradually become more robust with research progress on systematic ways to build supervised models.

Will reinforcement learning (RL) algorithms also become more robust in the next decade? I hope so. However, RL faces a unique obstacle in the difficulty of establishing real-world (non-simulation) benchmarks.

When supervised deep learning was at an earlier stage of development, experienced hyperparameter tuners could get much better results than less-experienced ones. We had to pick the neural network architecture, regularization method, learning rate, schedule for decreasing the learning rate, mini-batch size, momentum, random weight initialization method, and so on. Picking well made a huge difference in the algorithm’s convergence speed and final performance.

Thanks to research progress over the past decade, we now have more robust optimization algorithms like Adam, better neural network architectures, and more systematic guidance for default choices of many other hyperparameters, making it easier to get good results. I suspect that scaling up neural networks — these days, I don’t hesitate to train a 20 million-plus parameter network (like ResNet-50) even if I have only 100 training examples — has also made them more robust. In contrast, if you’re training a 1,000-parameter network on 100 examples, every parameter matters much more, so tuning needs to be done much more carefully.

My collaborators and I have applied RL to cars, helicopters, quadrupeds, robot snakes, and many other applications. Yet today’s RL algorithms still feel finicky. Whereas poorly tuned hyperparameters in supervised deep learning might mean that your algorithm trains 3x or 10x more slowly (which is bad), in reinforcement learning, it feels like they might result in training 100x more slowly — if it converges at all! Similar to supervised learning a decade ago, numerous techniques have been developed to help RL algorithms converge (like double Q learning, soft updates, experience replay, and epsilon-greedy exploration with slowly decreasing epsilon). They’re all clever, and I commend the researchers who developed them, but many of these techniques create additional hyperparameters that seem to me very hard to tune.

Further research in RL may follow the path of supervised deep learning and give us more robust algorithms and systematic guidance for how to make these choices. One thing worries me, though. In supervised learning, benchmark datasets enable the global community of researchers to tune algorithms against the same dataset and build on each other’s work. In RL, the more-commonly used benchmarks are simulated environments like OpenAI Gym. But getting an RL algorithm to work on a simulated robot is much easier than getting it to work on a physical robot.

Many algorithms that work brilliantly in simulation struggle with physical robots. Even two copies of the same robot design will be different. Further, it’s infeasible to give every aspiring RL researcher their own copy of every robot. While researchers are making rapid progress on RL for simulated robots (and for playing video games), the bridge to application in non-simulated environments is often missing. Many excellent research labs are working on physical robots. But because each robot is unique, one lab’s results can be difficult for other labs to replicate, and this impedes the rate of progress.

I don’t have a solution to these knotty issues. But I hope that all of us in AI collectively will manage to make these algorithms more robust and more widely useful.

Keep learning!

Andrew

Subscribe to The Batch