Conversational Robots RFM-1, a model that enables robots to understand and act on human commands

Published

Mar 20, 2024

Reading time

2 min read

Robots equipped with large language models are asking their human overseers for help.

What's new: Andrew Sohn and colleagues at Covariant launched RFM-1, a model that enables robots to respond to instructions, answer questions about what they see, and request further instructions. The model is available to Covariant customers.

How it works: RFM-1 is a transformer that comprises 8 billion parameters. The team started with a pretrained large language model and further trained it, given text, images, videos, robot actions, and/or robot sensor readings, to predict the next token of any of those types. Images and videos are limited to 512x512 pixels and 5 frames per second.

Proprietary models embed non-language inputs.
RFM-1 responds conversationally to text and/or image inputs. Given an image of a bin filled with fruit and the question “Are there any fruits in the bin?” the model can respond yes or no. If yes, it can answer follow-up questions about the fruit’s type, color, and so on.
Given a robotic instruction, the model generates tokens that represent a combination of high-level actions and low-level commands. For example, asked to “pick all the red apples,” it generates the tokens required to pluck the apples from a bin.
If the robot is unable to fulfill an instruction, the model can ask for further direction. For instance, in one demonstration, it asks, “I cannot get a good grasp. Do you have any suggestions?” When the operator responds, “move 2 cm from the top of the object and knock it over gently,” the robot knocks over the item and automatically finds a new way to pick it up.
RFM-1 can predict future video frames. For example, if the model is instructed to remove a particular item from a bin, prior to removing the item, it can generate an image of the bin with the item missing.

Behind the news: Covariant’s announcement follows a wave of robotics research in recent years that enables robots to take action in response to text instructions.

Why it matters: Giving robots the ability to respond to natural language input not only makes them easier to control, it also enables them to interact with humans in new ways that are surprising and useful. In addition, operators can change how the robots work by issuing text instructions rather than programming new actions from scratch.

We're thinking: Many people fear that robots will make humans obsolete. Without downplaying such worries, Covariant’s conversational robot illustrates one way in which robots can work alongside humans without replacing them.

Subscribe to The Batch