Direct Preference Optimization (DPO)
Reasoning Models With Recipes: Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning
Microsoft published its latest recipe for training reasoning models, substantially expanding what is still a fairly small base of public knowledge.