One Weird Trick for Better Reasoning Researchers fine-tune LLM for reasoning with only 1,000 examples

Published

May 7, 2025

Reading time

3 min read

Researchers showed that supervised fine-tuning on as few as 1,000 examples can enable a pretrained large language model to reason — and a clever gambit can boost its performance to rival that of top reasoning models.

What’s new: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li and colleagues at Stanford, University of Washington, Allen Institute for AI, and Contextual AI developed s1, a reasoning model that achieves higher performance by producing more reasoning tokens. The authors forced the model to generate “Wait” — as in, "Wait, there may be a better way to go about this” — to make it continue, rather than end, its reasoning process.

Key insight: The sequence of reasoning tokens generated by a reasoning model like DeepSeek-R1 is delimited by special tokens. In pretraining on human data, a model learns to keep generating reasoning tokens until it generates the special token that ends the sequence. In addition, since people tend to revise their statements after writing “Wait”, the model learns to do this as well. Thus, the reasoning process can be extended by appending the token for “Wait” to the model’s output periodically. In this way, when the output-in-progress is fed back to generate the next token, the model continues to reason over the prompt. Such extended reasoning can improve the final output by inducing the model to double-check its response so far and improve previous reasoning steps.

How it works: The authors fine-tuned a pretrained Qwen 2.5-32B, which does not produce reasoning tokens, on around 1,000 examples of chain-of-thought reasoning.

To build a fine-tuning dataset, the authors gathered roughly 59,000 questions and answers from 16 sources. The sources included math problems from NuminaMath and AIME and questions from OlympicArena on astronomy, biology, chemistry, computer science, geography, mathematics, and physics. They also included standardized test questions from SAT and LSAT via AGIEval.
They removed examples with formatting issues (such as references to images that were missing) and questions that Qwen2.5-7B or Qwen2.5-32B could already solve. Then Gemini Flash Thinking generated a chain of thought for each remaining example. Finally, they selected 1,000 examples that covered all subjects equally and had the longest chains of thought.
They fine-tuned the model to generate the next token.
To control the number of reasoning tokens generated, at inference, the authors forced the model to either stop the process or extend it by replacing the end-reasoning token with one for “Wait”, after which the model continued.

Results: s1’s performance improved as the number of reasoning tokens it generated increased. Ultimately, it achieved comparable performance to OpenAI o1-preview but fell short of o1.

On AIME 2024, s1 achieved 50.0 percent accuracy without forcing it to continue reasoning. When forced to continue reasoning twice, its accuracy rose to 53.3 percent. When forced four times, it reached 56.7 percent accuracy, between o1-preview (44.6 percent accuracy) and o1 (74.4 percent accuracy).
On MATH 500, s1 started at 92.6 percent accuracy. Forced to continue once, it reached 92.8 percent accuracy. Forced twice it reached 93.0 percent accuracy, higher than o1-preview (85.5 percent accuracy) but lower than o1 (94.8 percent accuracy). When forced four times, s1’s performance fell to 92.2 percent accuracy. The authors don’t offer a hypothesis to explain the decline.

Why it matters: A conventional pretrained LLM can learn to reason after supervised fine-tuning on as few as 1,000 curated examples — no reinforcement learning necessary. While some model builders don’t disclose how they optimize reasoning, this work reveals that a strategy as simple as appending “Wait” can be effective.

We’re thinking: Wait, how can we apply this to our projects?

Subscribe to The Batch