Reinforcement Learning With Hints Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Published

Jun 19, 2026

Reading time

3 min read

Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps. But giving the model the first few steps can make all the difference.

What’s new: Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov and Aviral Kumar from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method for large language models that pairs the reinforcement learning algorithm GRPO with custom-built datasets. During training on a problem that LLMs frequently don’t solve, such as a difficult math problem, besides giving the model the problem, POPE appends the beginnings of a solution.

Key insight: In supervised fine-tuning, given a problem and a solution, a model can learn to generate the solution. But it may learn that specific solution rather than general problem-solving skills that would lead to solutions that weren’t in the training data. In reinforcement learning, the beginning of a solution can serve as a hint that helps the model discover a solution. For example, along with an instruction to “Solve this geometry problem,” the model may also receive the first few steps such as “Draw the auxiliary triangle and apply the Pythagorean theorem…,” and continue from there. Given both hinted and unhinted versions of the same problems during training, the model can also find the early steps without hints.

How it works: The authors used the customized dataset to fine-tune a pretrained Qwen3-4B-Instruct-2507 via GRPO.

Starting with three math datasets of problems with known solutions, the authors selected examples that the pretrained model failed to solve correctly in 128 attempts, generating up to 32 thousand tokens per attempt.
For each example, the authors extracted the beginning of the solution, or prefix. They fed Qwen3-4B-Instruct-2507 progressively longer prefixes, up to a quarter the length of the solution, until it correctly completed the solution.
They appended this prefix to the corresponding example along with an instruction to continue solving the task from that point onward.
During GRPO, they showed the model each problem many times both with and without its prefix in an equal ratio. If the model solved the problem, GRPO adjusted the model’s weights to increase the probability that it would generate the same tokens, making similar solutions more likely. If it failed, GRPO adjusted the model’s weights to decrease the probability.

Results: The authors compared Qwen3-4B-Instruct-2507 after fine-tuning via POPE versus typical GRPO and supervised fine-tuning. It consistently outperformed both, and it outperformed supervised fine-tuning by a large margin. They evaluated the results after one try (pass@1) and 16 tries (pass@16).

On the AIME 2025 dataset of competition math problems, POPE (53.1 percent pass@1, 82.6 percent pass@16) outperformed typical GRPO (49.6 percent pass@1, 81.4 pass@16).
On the HMMT 2025, which is also made up of competition math problems, POPE (37.8 percent pass@1, 67.5 percent pass@16) exceeded typical GRPO (31.0 percent pass@1, 63.8 percent pass@16).

Yes, but: POPE requires problems with known solutions. In domains where such solutions are expensive to obtain, it inherits that cost.

Why it matters: This work attacked one of the biggest bottlenecks in reinforcement learning: exploration. Current reinforcement learning methods work best with problems that models already can partly solve. When problems are hard, reinforcement learning burns large amounts of computation in exploration, which algorithmically boils down to “keep trying and hope to stumble onto a successful solution.” POPE leads the model onto the right track, after which reinforcement learning can be more effective.

We’re thinking: This approach breaks up learning hard problems into two steps: (i) finding a good state from which to solve the problem and (ii) solving the problem. Instead of trying to do both at once, an LLM starts with (ii), and once it learns that, it’s easier to learn how to do (i).

Subscribe to The Batch