When you’re training a robot via reinforcement learning, a handcrafted reward function is labor-intensive to build but often dispenses rewards more effectively than a general-purpose reward model based on a vision-language model. Researchers built reward models that narrowed the gap.
What’s new: Tony Lee, Andrew Wagenmaker, Karl Pertsch, and colleagues at Stanford University and UC Berkeley built RoboReward, a family of vision-language reward models in 4 billion-parameter and 8 billion-parameter sizes. These models reward a variety of different tasks performed by a variety of robot types. The authors also provide a dataset and benchmark for training and evaluating vision-language reward models.
Key insight: Popular text-video datasets of robot actions mainly include examples of successful actions, which makes it hard for models to learn the difference between success and failure. But it’s possible to produce negative examples by relabeling positive examples (for example, given an example that includes a video of a robot putting a spoon in a pot, replace the command “put the spoon in the pot” with “put the spoon by the pot”). It’s also possible to produce incomplete attempts by trimming videos of successful actions.
How it works: The authors built a diverse robot-action dataset in which each example included a command, a video of a robot responding to the command, and a progress score from 1 (failed) to 5 (completed). They gathered videos from two datasets that depict single-arm, dual-arm, and humanoid robots. They standardized the task descriptions, augmented the data with negative examples, and assigned progress scores.
- To produce negative examples, the authors used GPT-5 mini, given a video that depicted a robot successfully executing a command, to describe (i) the scene, (ii) the robot’s actions, and (iii) the final state of the scene and robot. Given this analysis, Qwen3-4B-Instruct-2507 proposed alternative commands for which the video would merit a progress score less than 5. GPT-5 mini analyzed the alternative examples and discarded the ones with labels that didn’t match the corresponding videos.
- The authors truncated videos of successfully executed commands to create negative examples of partial progress.
- GPT-5 mini assigned each example a progress score from 1 to 4.
- The authors fine-tuned Qwen3-VL 4B and Qwen3-VL 8B, given a video and a command, to predict the progress score. The progress score served as a reward.
- They manually verified 2,831 examples to form a test dataset called RoboRewardBench.
Results: The RoboReward models estimated rewards for examples in RoboRewardBench more accurately than the robotics model Gemini Robotics-ER 1.5 and generalist models including GPT-5. In a real-world robot demonstration, training based on rewards from RoboReward models resulted in better performance than training via previous reward models, though not better than training via human-assigned rewards.
- Given a video and command from RoboRewardBench, the reward models calculated a reward, and the authors evaluated the results according to mean absolute error (lower is better). RoboReward 8B (0.665 mean absolute error) outperformed 21 other models including GPT-5 mini (0.691 mean absolute error), GPT-5 (0.811 mean absolute error), and Gemini Robotics-ER 1.5 (0.906 mean absolute error). RoboReward 4B (0.845 mean absolute error) achieved 4th place, outperforming 18 competitors including Gemini 3 Pro (0.851 mean absolute error).
- In the real-world demonstration, a diffusion transformer that was trained to manipulate a WidowX robot arm via rewards from RoboReward 8B outperformed one that was trained using rewards from Gemini Robotics-ER 1.5. Picking up a toy and placing it on a towel, the model trained on RoboReward 8B (50 percent) dramatically outperformed a model trained on Gemini Robotics-ER 1.5 (10 percent) but underperformed a human who manually assigned rewards (75 percent). Similarly, opening a drawer, the model trained on Roboreward 8B (80 percent) substantially outperformed the model that used Gemini Robotics-ER 1.5 (45 percent) but underperformed one that used human-assigned rewards (90 percent).
Why it matters: Vision-Language reward models have been promising in training robots, and this approach makes them much more effective. By augmenting successful demonstrations with validated failures, the authors trained a general‑purpose reward model that works across various types of robots and tasks, alleviating the need for task‑specific engineering.
We’re thinking: Releasing a benchmark and pretrained reward models invites the community to improve reward functions directly rather than hoping they emerge as a side-effect of better generalist models.