Typical deep reinforcement learning requires millions of training iterations as the network stumbles around in search of the right solution. New research suggests that AI can learn faster by considering its mistakes.
What’s new: Previous work in reinforcement learning taught machines by having them either experiment until they got it right or imitate human demonstrations. Allan Zhou and his team at Google Brain merge the two approaches in an algorithm they call Watch, Try, Learn.
Key insight: Unlike many RL models, WTL decides what actions to take by combining its own prior performance with human demonstrations as an explicit input. This allows it to correct its behavior quickly and minimize repeated blunders.
How it works: WTL is a training algorithm rather than a specific model. It trains two separate policies, or predictions of what action will maximize its reward, called trial and retrial. The trial policy generates exploration attempts, while the retrial policy decides the best actions. These policies are trained on several tasks.
- WTL starts with a handful of human demonstrations. The trial policy uses them to try to learn directly how to replicate human demonstrations. Trial attempts are recorded alongside the demonstrations to form a large dataset including cases of both successes and failures.
- Then it acts according to the retrial policy. This policy — combining the current state, a human demonstration, and trial attempts at the task — is trained to imitate successful demonstrations and trials.
- During inference, even a few demonstrations can generate a large set of trial explorations that help the retrial policy to generate actions that successfully complete tasks previously unseen in training.
Why it matters: WTL can learn a greater number and variety of robotic object manipulation tasks than previous RL models. Further, previous models are limited to a single task, while WTL learns multiple tasks concurrently and outperforms single-task models in every task. This allows WTL to master new abilities in few-shot settings and with less computation.
Takeaway: Who wants a robot helper that requires thousands of attempts to learn how to empty a dishwasher without breaking dishes? And, once it has learned, can’t do anything else? Zhou et al. raise the prospect that smarter, more flexible robots may be just around the corner — though they’re still bound to break a few dishes.