Fine-tuning large language models via reinforcement learning is computationally expensive, but researchers found a way to streamline the process.
What’s new: Qinsi Wang and colleagues at UC Berkeley and Duke University developed GAIN-RL, a method that accelerates reinforcement learning fine-tuning by selecting training examples automatically based on the model’s own internal signals, specifically the angles between vector representations of tokens. The code is available on GitHub.
Key insight: The cosine similarity between a model’s vector representations of input tokens governs the magnitude of gradient updates during training. Specifically, the sum of those similarities that enter a model’s classification layer, called the angle concentration, governs the magnitude of gradient updates. Examples with higher angle concentration produce larger gradient updates. The magnitude of a gradient update in turn determines the effectiveness of a given training example: The larger the update, the more the model learns. Prioritizing the most-effective examples before transitioning to less-effective ones enhances training efficiency while adding little preprocessing overhead.
How it works: The authors separately fine-tuned Qwen 2.5 1.5B, Qwen 2.5 7B, and Llama 3.2 3B using the GRPO reinforcement learning algorithm with examples ordered according to their angle concentration. The datasets included math problems in GSM8K and AMC 23, and coding problems in LiveCodeBench and HumanEval+.
- Given a training set, the authors calculated the angle concentration of each example by performing a single forward pass on the entire dataset. They sorted examples from highest to lowest angle concentration.
- They fine-tuned the models, focusing first on examples with the highest angle concentrations and shifting toward lower angle concentrations as training progressed. They tracked the models’ learning according to accuracy and the angle concentration on each batch of data. They shifted the focus more toward less-effective examples as the model learned and shifted less when it struggled.
- They continued training for 200 epochs.
Results: The authors compared models that were fine-tuned using GAIN-RL with counterparts that used GRPO performed on randomly ordered examples. GAIN-RL generally accelerated learning by a factor of 2.5.
- Whether the task involved math or coding, GAIN-RL took 70 to 80 training epochs to match the performance of fine-tuning using typical GRPO for 200 epochs.
- For instance, on GSM8K, Qwen 2.5 Math Instruct 7B after GAIN-RL fine-tuning achieved 92.0 percent accuracy after 70 epochs. The version fine-tuned on typical GRPO needed 200 epochs to reach the same performance.
Why it matters: Many strategies for ordering training examples rely on external, often expensive heuristics based on their difficulty, for example judgments by human annotators or a proprietary LLM. By using a simple signal generated by the model itself, this method provides a direct and efficient way to identify the most effective examples, making reinforcement learning much faster.
We’re thinking: Ordering training examples is much older than applying reinforcement learning to fine-tuning large language models. Applying earlier methods to more recent approaches holds many advances in machine learning!