Neural networks can forget how to perform earlier tasks as they learn new ones. A simple recipe addresses this problem for vision-language models, specifically in robotics applications.
What’s new: Jiaheng Hu, Jay Shim, and colleagues at University of Texas Austin, University of California Los Angeles, Nanyang Technological University, and Sony trained large vision-language-action models using a combination of reinforcement learning and low-rank adaptation (LoRA) to outperform established methods for robotics training in simulation. Their recipe reduced catastrophic forgetting, which can occur when models learn tasks sequentially.
Key insight: Together, large pretrained models, LoRA, and on-policy reinforcement learning reduce the amount of information a model can forget while training.
- The trend toward large pretrained models limits how much models can forget during post-training. In a model that has a huge number of parameters, small updates are likely to not interfere with existing knowledge.
- LoRA, which adjusts model weights by adding to them the product of two small matrices, limits how much models can change. Thus, when it’s applied at inference, it limits how much they can forget.
- On-policy reinforcement learning methods such as GRPO also limit updates, since they reward actions the model itself generated — so they, too, limit how much models can forget while learning new things. In contrast, supervised fine-tuning and off-policy reinforcement learning, which rewards models for taking actions that were chosen by a separate policy, can result in large updates if a model learns actions it might not have performed previously.
How it works: The authors fine-tuned a large pretrained vision-language-action (VLA) model (OpenVLA-OFT) on each of three task suites in the LIBERO benchmark executed by a simulated robot arm. Each suite contained five tasks such as opening a drawer or moving an object to a target location. The authors fine-tuned the models on each task sequentially.
- At each step, a model took as input an image and instruction, and it predicted a sequence of continuous actions to control the robot arm and gripper.
- The authors fine-tuned the models using GRPO and LoRA without reusing data from previous tasks to train on new tasks. During GRPO, the model received a reward for completing each task.
Results: The authors’ method matched or outperformed earlier methods for iteratively learning robotics tasks, which the authors combined with GRPO and LoRA for fair comparison. It resulted in very little forgetting as well as slight improvement on tasks that models had not encountered during fine-tuning. Removing any individual component caused performance to collapse and led to strong forgetting.
- On the libero-spatial tasks, the authors’ method reached 81.2 percent average success rate. This result exceeded Dark Experience Replay (73.4), an approach that reuses data; SLCA (69.9), which uses higher learning rates in output layers and lower learning rates in earlier layers, so early layers change less during training; and Elastic Weight Consolidation (66.1), which aims to preserve knowledge by penalizing changes to weights that were important for previous tasks.
- The authors’ method showed near-zero forgetting (0.3 percentage point average drop in success rate on previously learned tasks) on libero-spatial, lower than Elastic Weight Consolidation (0.7) and Dark Experience Replay (4.7), and comparable to SLCA (-0.6, meaning performance on earlier tasks improved slightly).
- On five additional libero-spatial tasks, the model did not encounter during training, the authors’ method reached 57.1 percent average success rate, outperforming Elastic Weight Consolidation (52.6) and Dark Experience Replay (55.2).
Yes, but: In their comparisons, the authors added to the earlier methods LoRA and GRPO using the LIBERO dataset. But the earlier methods weren’t designed to combine with those techniques or use that data, and it’s not clear how they would have compared had they been applied strictly as intended. For instance, Dark Experience Replay, while fine-tuning a model on a new task, aims to avoid forgetting by re-introducing examples that were used in fine-tuning for earlier tasks. Adding LoRA may affect the learning of new tasks.
Why it matters: Training a robot on all tasks at once can be effective, but it requires that all tasks are mapped out ahead of time. If tasks change, it becomes helpful to train on one task at a time, and in many cases it’s valuable to retain earlier training. Relative to prior methods, the authors’ sequential fine-tuning approach is simpler, easier to understand, and more effective under the conditions they tested. (The authors didn’t explore whether it would be effective beyond robotics.)
We’re thinking: Robots are rapidly entering new environments and situations. Nimble operations will benefit from robots that adapt to new tasks on the fly.