Reading time
3 min read
Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots

Training an agent that controls a robot arm to perform a task — say, opening a door — that involves a sequence of motions (reach, grasp, turn, pull, release) can take from tens of thousands to millions of examples. A new approach pretrained an agent on many tasks for which lots of data was available, so it needed dramatically fewer examples to learn related tasks.

What’s new: Joey Hejna and Dorsa Sadigh at Stanford used a variation on reinforcement learning from human feedback (RLHF) to train an agent to perform a variety of tasks in simulation. The team didn’t handcraft the reward functions. Instead, neural networks learned them.

RLHF basics: A popular approach to tuning large language models, RLHF follows four steps: (1) Pretrain a generative model. (2) Use the model to generate data and have humans assign a score to each output. (3) Given the scored data, train a model — called the reward model — to mimic the way humans assigned scores. Higher scores are tantamount to higher rewards. (4) Use scores produced by the reward model to fine-tune the generative model, via reinforcement learning, to produce high-scoring outputs. In short, a generative model produces an example, a reward model scores it, and the generative model learns based on that score.

Key insight: Machine-generated data is cheap, while human-annotated data is expensive. So, if you’re building a neural network to estimate rewards for several tasks that involve similar sequences of motions, it makes sense to pretrain it for a set of tasks using a large quantity of machine-generated data, and then fine-tune a separate copy for each task to be performed using small amounts of human-annotated data.

  • The Meta-World benchmark provides machine-generated data for reinforcement learning (RL): It provides simulated environments for several tasks and trained models that execute the tasks. The models make it possible to record motion sequences along with a model’s estimate of its probability of success for each possible motion. Collecting high- and low-probability sequences provides a large dataset of good and bad motions that translate into high or low rewards.
  • Humans can annotate such sequences to create a smaller number of examples of motions and rewards. These examples can be curated to highlight cases that make for more efficient learning.

How it works: The authors trained an RL agent to perform 10 simulated tasks from Meta-World such as pushing a block, opening a door, and closing a drawer. For each task, they fine-tuned a separate pretrained vanilla neural network to calculate rewards used in training the agent.

  • The authors pretrained the reward model using a method designed to find weights that could be readily fine-tuned for a new task using a small number of examples. Given two motion sequences and their probabilities (generated by models included in Meta-World), the network was pretrained to decide which was worse or better for executing the task at hand.
  • For six new tasks, the authors generated a small number (between 6 and 20 depending on the task) of motion sequences using their agent. Human annotators labeled them better or worse for executing the task at hand. The authors fine-tuned the reward model on these examples.
  • Using a small number of motion sequences for the task at hand, the authors trained the agent to complete the task based on rewards calculated by the reward model.
  • The authors repeated the loop — fine-tuning the reward model and training the agent — fine-tuning the reward model on up to 100 total human-annotated motion sequences for a task. They stopped when the agent’s performance no longer improved.
  • The authors tried the same experiment substituting human annotations for Meta-World’s model-generated probabilities for the motion sequences. It took up to 2,500 total sequences for the agent to reach its optimal performance.

Results: Trained to open a window, the agent achieved 100 percent success after fine-tuning on 64 human-annotated motion sequences. Trained to close a door, it achieved 95 percent success with 100 human-annotated motion sequences. In contrast, using the same number of examples, PEBBLE, another RL method that involves human feedback, achieved 10 percent and 75 percent success respectively. Fed machine-generated examples rather than human feedback, the agent achieved 100 percent success on all Meta-World tasks except pressing a button after fine-tuning on 2,500 examples — 20 times fewer than PEBBLE required to achieve the same performance.

Why it matters: OpenAI famously fine-tuned ChatGPT using RLHF, which yielded higher-quality, safer output. Now this powerful technique can be applied to robotics.

We’re thinking: Pretraining followed by fine-tuning opens the door to building AI systems that can learn new tasks from very little data. It's exciting to see this idea applied to building more capable robots.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox