Reasoning Models With Recipes Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning

Published
Reading time
3 min read
Table comparing AI model accuracy on math and reasoning benchmarks including AIME, HMMT, OmniMath, GPQA-D, and Codeforces.
Loading the Elevenlabs Text to Speech AudioNative Player...

Microsoft published its latest recipe for training reasoning models, substantially expanding what is still a fairly small base of public knowledge.

What’s new: Microsoft released Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning along with lessons learned in building the models.

  • Input/output: text in (Phi-4-reasoning up to 32,000 tokens, Phi–4-reasoning-plus up to 32,000 tokens, Phi-4-mini-reasoning up to 128,000 tokens), text out
  • Architecture: Transformer (Phi-4-reasoning 14 billion parameters, Phi-4-reasoning-plus 14 billion parameters, Phi-4-mini-reasoning: 3.8 billion parameters)
  • Features: Reasoning
  • Performance: Phi-4-reasoning-plus and Phi-4-mini-reasoning perform well on math problems
  • Availability: Weights free to download for noncommercial and commercial uses under an MIT license

How it works: All three models are fine-tuned versions of pretrained models.

  • Phi-4-reasoning: The authors fine-tuned Phi-4 to match curated outputs from OpenAI o3-mini on Q&A, math, science, and coding examples.
  • Phi-4-reasoning-plus: They further fine-tuned Phi-4-reasoning via reinforcement learning to correctly answer math problems.
  • Phi-4-mini-reasoning: They fine-tuned Phi-4-mini in stages to reason over math problems. Stages included (i) supervised fine-tuning to match correct output from DeepSeek-R1, (ii) direct preference optimization to train the model to prefer correct responses over incorrect ones from DeepSeek-R1, and (iii) reinforcement learning to further reward correct solutions to math problems.

Smaller model lessons learned: During reinforcement learning, Phi-4-mini-reasoning exhibited instability, such as output batches that varied greatly in length or received mostly negative rewards, apparently depending on the training data or output. The authors suspect that the model’s small size caused these issues. Among the lessons learned:

  • Supervised fine-tuning on existing reasoning datasets like S1K can decrease performance. This phenomenon suggests a need either for larger, high-quality, supervised fine-tuning datasets or for fine-tuning via both supervised learning and reinforcement learning.
  • To minimize discrepancies in output length, the authors tested multiple prompts and chose those that resulted in the most uniform output lengths.
  • To address the output batches that received mostly negative rewards, they sampled lots of responses, retained those that received a positive reward, sampled an equal number of those that received a negative reward, and discarded the rest before adjusting the model’s weights.

Larger model lessons learned: Phi-4-reasoning and Phi-4-reasoning-plus didn’t present the same issues. However, the authors did make significant choices during reinforcement learning:

  • The authors fine-tuned Phi-4-reasoning on both math and code data, but during reinforcement learning, they fine-tuned it only on math data to simplify the training process. The authors attribute the model’s relatively lackluster performance on code benchmarks to this choice.
  • They crafted the reward function to give lower rewards for correct responses longer than 25,600 tokens than for shorter responses. This encouraged the model to finish thinking within the input length. Furthermore, the reward function gave a greater punishment for incorrect responses with fewer than 3,702 tokens compared to longer responses. This encouraged the model to produce more reasoning tokens when solving hard problems.

Results: Overall, Phi-4-reasoning-plus and Phi-4-mini-reasoning outperform similarly sized (and larger) open-weights models on math problems. Phi-4-reasoning generally outperformed DeepSeek-R1-Distilled-70B but underperformed Alibaba QwQ 32B. All three models deliver performance that falls in the middle among proprietary models and, in domains outside math, larger models with open weights.

  • On math problems in AIME 2024, Phi-4-reasoning-plus (81.3 percent accuracy) outperformed the next-best open-weights model, QwQ 32B (79.5 percent accuracy). In comparison, Phi-4-reasoning (74.6 percent accuracy) underperformed the proprietary Gemini 2.5 Pro (92 percent accuracy).
  • On AIME 2024, Phi-4-mini-reasoning (57.5 percent accuracy) outperformed the next-best open-weights model of similar size, DeepSeek-R1-Distill-Qwen-7B (53.3 percent accuracy). In comparison, o1-mini achieved 63.6 percent accuracy.

Why it matters: While reasoning models can outperform their non-reasoning counterparts, the best ways to train them aren’t widely known. Sharing recipes and lessons learned enables others to further iterate and improve the recipes, ultimately increasing model performance even more.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox