A new method enables robots to respond helpfully to verbal commands by pairing a natural language model with a repertoire of existing skills.
What’s new: SayCan, a system developed by researchers at Google and its spinoff Everyday Robots, enabled a robot equipped with an arm, camera, and gripper to take a high-level command such as “I spilled my drink, can you help?” and choose low-level actions appropriate to a given environment such as “find a sponge” and “go to table.”
Key insight: A pretrained large language model can grasp verbal instructions well enough to propose a general response. But it can’t adapt that response to local conditions; for instance, an environment that includes a sponge but not a mop. Combining a large language model with a model that determines which actions are possible in the current environment makes for a system that can interpret instructions and respond according to the local context.
How it works: SayCan drew from over 550 kitchen-related actions that the authors had trained it to perform using a combination of image-based behavioral cloning and reinforcement learning. Actions included picking up, putting down, and rearranging objects; opening and closing drawers; and navigating to various locations.
- Given a command, PaLM, a large language model, considered each action in turn and calculated the probability that it would respond with the description of that action. For instance, if instructed to clean up a spill, PaLM calculated the probability that it would respond, “find a sponge.”
- A reinforcement learning model trained via temporal difference learning learned to estimate the likelihood that the robot would execute the action successfully, accounting for its surroundings. For instance, the robot could pick up a sponge if it saw one, but it couldn’t otherwise. Human judges determined whether the robot had completed a given skill in videos and applied a reward accordingly.
- SayCan multiplied the two probabilities into a single score to determine the most appropriate action. It used a set of convolutional neural networks to decide how to move the robot arm. These networks learned either by copying recorded actions or by reinforcement learning in a simulation.
- After the robot performed an action, SayCan appended the description to the initial PaLM query and repeated the process until it chose the “done” action.
Results: The authors tested the system by giving the robot 101 commands in a mock kitchen that contained 15 objects such as fruits, drinks, snacks, and a sponge. Human judges determined that the robot planned valid actions 84 percent of the time and carried them out 74 percent of the time. In a real-life kitchen, the robot achieved 81 percent success in planning and 61 percent success in execution.
Why it matters: The dream of a domestic robot has held the public imagination since the dawn of the industrial revolution. But robots favor controlled environments, while households are highly varied and variable. The team took on the challenge by devising a way to choose among 551 skills and 17 objects. These are large numbers, but they may not encompass mundane requests like “find granny’s glasses” and “discard the expired food in the fridge.”
We’re thinking: This system requires a well-staged environment with a small number of items. We imagine that it could execute the command, “get the chips from the drawer” if the drawer contained only a single bag of chips. But we wonder whether it would do well if the drawer were full and messy. Its success rate in completing tasks suggests that, as interesting as this approach is, we’re still a long way from building a viable robot household assistant.