Robots trained via reinforcement learning usually study videos of robots performing the task at hand. A new approach used videos of humans to pre-train robotic arms.
What’s new: UC Berkeley researchers led by Tete Xiao and Ilija Radosavovic showed that real-world videos with patches missing were better than images of robot arms for training a robot to perform motor-control tasks. They call their method Masked Visual Pretraining (MVP). They also built a benchmark suite of tasks for robot arms.
Key insight: One way to train a robot arm involves two models: one that learns to produce representations of visual input and a much smaller one, the controller, that uses those representations to drive the arm. Typically, both models learn from images of a robotic arm. Surprisingly, pretraining the vision model on images of humans performing manual tasks not only results in better representations but also reduces the cost of adapting the system to new tasks. Instead of retraining the whole system on images of a new task, object, or environment, the controller alone can be fine-tuned.
How it works: The authors pretrained a visual model to reproduce images that had been partly masked by obscuring a rectangular portion at random. The pretraining set was drawn from three video datasets that include clips of humans performing manual actions such as manipulating a Rubik’s Cube. They used the resulting representations to fine-tune controllers that moved a robot arm in a simulation. They fine-tuned a separate controller for each of four tasks (opening a cabinet door as well as reaching, picking up, and relocating objects of different colors, shapes, and sizes) for each of two types of arm (one with a gripper, the other with four fingers).
- The authors pretrained the vision transformer — a masked autoencoder — to reconstruct video frames that were masked by as much as 75 percent.
- They passed representations from the transformer, along with the positions and angles of the robot arm joints, to the controllers. They used PPO to train the controllers to move the arms.
- Each controller used a different reward depending on the task. Reward functions varied depending on factors such as the distance between the robot hand or the object it was manipulating and a goal location.
Results: In all eight tasks, the authors’ approach outperformed two state-of-the-art methods that train the visual and controller models on images of robots for training. The authors compared their representations to those produced by a transformer trained on ImageNet in supervised fashion. In seven tasks, the controller that used their representations outperformed one that used the supervised transformer’s representations. In the eighth, it performed equally well. In tasks that required a four-fingered arm to pick up an object, the authors’ approach achieved a success rate of 80 percent versus 60 percent.
Yes, but: The authors didn’t compare masked pretraining on images of humans with masked pretraining on images of robots. Thus, it’s not clear whether their method outperformed the baseline due to their choice of training dataset or pretraining technique.
Why it matters: Learning from more varied data is a widely used approach to gaining skills that generalize across tasks. Masked pretraining of visual models has improved performance in video classification, image generation, and other tasks. The combination looks like a winner.
We’re thinking: Variety of data is important, but so is its relation to the task at hand. ImageNet probably is more varied than the authors’ training set of humans performing manual actions, but it’s unrelated to tasks performed by robot arms. So it stands to reason that the authors’ dataset was more effective.