To interact with the world, a robot needs to know which items to grab, which items to avoid, and which joints to move in the process. A new method aims to improve a machine’s ability to determine and locate points of interest.

What's new: Boyuan Chen and Pieter Abbeel at UC Berkeley with Deepak Pathak at Carnegie Mellon developed Keypoint3D, an unsupervised training method that enables a model to identify spatial coordinates, known as keypoints, that correspond to useful locations in the environment — including spots on its own body.

Key insight: Previous work trained an agent in a virtual world to find keypoints based on a single two-dimensional camera view, but it performed poorly if that view contained overlapping or occluded objects. A similar approach can take advantage of multiple camera views to locate objects in 3D space. Using inferred 3D keypoints to regenerate the original camera views can help the agent learn to track particular objects consistently across time and space.

How it works: Keypoint3D trained a system to choose 32 keypoints most useful in completing a particular task and find their locations based on three camera views. The system was trained and tested in a virtual environment, where it drove an agent to complete a set of basic robot tasks (opening a door, closing a box, and so on), as well as draping a scarf over a mannequin (to demonstrate the system’s ability to manipulate the flexible material) and walking on four legs (to demonstrate its ability to find the agent’s own joints). They trained a reinforcement learning model jointly with the keypoint detection models to ensure that the choice of keypoints would be relevant to the task at hand.

  • Three convolutional encoder networks (one for each camera view) learned to generate 32 two-dimensional heat and depth maps that indicated the probability that each pixel corresponded to a viable keypoint such as the end of a door handle. The system used the heat maps to calculate expected 2D coordinates of high-probability pixels. They used the depth maps to calculate the third dimension. The model used this information to produce three estimates of the location of each of 32 likely keypoints.
  • The system summed the three estimates in a weighted average to reach a final estimate of each keypoint’s location in 3D space. The weights came from the probabilities in the corresponding heat and depth maps.
  • The authors used the reinforcement learning algorithm proximal policy optimization (PPO) to train a vanilla neural network, given the estimated coordinates, to complete a given task. For example, given the locations of a quadruped’s joints, the model determined how to move the joints to make it walk.
  • During training, the system used the coordinates to generate three views via separate convolutional decoders. They calculated three unsupervised training loss functions that (a) encouraged a generated image to be similar to the corresponding original, (b) encouraged the keypoint coordinates to be similar in each view, and (c) discouraged keypoints from bunching. It combined the unsupervised losses in a weighted sum with the loss from the reinforcement learning policy.

Why it matters: Other efforts to generate 3D keypoints from multiple views have been designed to locate static objects. This method accounts for changes over time to drive robots that can interact with dynamic environments.

We're thinking: It may seem odd to move a robot by guessing the locations of its joints in still images rather than knowing the joint positions in the first place. But this is how humans do it, too — try to bring together your left and right fingertips with your eyes closed. Giving robots this capability would enable them to locate and control objects with greater precision.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox