Thinking Models Solve Bigger Problems Reasoning models, beginning with OpenAI’s o1 and DeepSeek’s R1, transformed the industry

Published

Dec 26, 2025

Reading time

3 min read

Think step by step. Explain your reasoning. Work backwards from the answer. As 2025 began, models executed these reasoning strategies only when prompted. Now most new large language models do it as a matter of course, improving performance across a wide range of tasks.

What happened: Late last year, OpenAI introduced the first reasoning, or “thinking,” model, o1, which baked in an agentic reasoning workflow. In January, DeepSeek-R1 showed the rest of the world how to build such capabilities. The result: immediate improvements in math and coding performance, more accurate answers to questions, more capable robots, and rapid progress in AI agents.

Driving the story: An early form of reasoning took off with “Large Language Models Are Zero-Shot Reasoners,” the paper that introduced the prompt addendum, “let’s think step by step.” The authors found that manually adding these words to a prompt improved a model’s output. Researchers soon realized they could train this capability into models so they would employ this and other reasoning strategies without explicit prompting. The key: fine-tuning via reinforcement learning (RL). Giving a pretrained LLM a reward for producing correct output trained it to “think” things through before it generated output.

The first few reasoning models were trained via RL specifically to solve math problems correctly, answer science questions accurately, and/or generate code that passed unit tests. This enabled o1-preview, for instance, to outperform its non-reasoning predecessor GPT-4o by 43 percentage points on AIME 2024 (competition math problems) and 22 percentage points on GPQA Diamond (PhD-level science questions), while it completed Codeforces’ coding problems in the 62nd percentile relative to competitive human coders, compared to GPT-4o’s 11th percentile.
Reasoning models performed even better when they learned to use tools like calculators, search engines, or bash terminals. For example, on a challenging test of multimodal understanding and technical expertise in 100 domains, OpenAI o4-mini with tools achieved 17.7 percent accuracy, more than 3 points higher than it managed without tools.
Robotic action models have been trained to reason via RL. For example, rewarding ThinkAct for reaching a goal position yielded roughly an 8 percent performance improvement on robotics tasks compared to non-thinking models like OpenVLA.
Reasoning models also help agents to tackle difficult problems. For instance, AlphaEvolve used Google Gemini to repeatedly generate, evaluate, and change code, ultimately producing faster algorithms for real-world problems. Similarly, AI Co-Scientist used Gemini to generate scientific research proposals and then review, rank, and improve them. Among other results, it proposed a hypothesis to answer a longstanding question about microbial resistance to antibiotics. Human scientists independently proposed and validated the same hypothesis at about the same time.

Yes, but: Reasoning models may not be as rational as they seem.

In a controversial paper, Apple concluded that reasoning models couldn’t solve puzzles beyond a certain level of complexity, even when the models were given algorithms that solved them. The models’ inability to apply the algorithms calls into question apparent similarities between machine and human reasoning.
Anthropic found that, while a model’s reasoning steps can help to explain how it reached a conclusion, they may also omit crucial information that contributed to the conclusion. For instance, reasoning models can be led to produce a particular output by including a hint in the prompt, but their reasoning steps may fail to mention the hint.

Where things stand: Reasoning dramatically improves LLM performance. However, better output comes at a cost. Gemini 3 Flash with reasoning enabled used 160 million tokens to run the benchmarks in Artificial Analysis’ Intelligence Index (and achieved a score of 71), while Gemini 3 Flash without reasoning used 7.4 million tokens (achieving a much lower score of 55). Moreover, generating reasoning tokens can delay output, adding to pressure on LLM inference providers to serve tokens faster. But researchers are finding ways to make the process more efficient. Claude Opus 4.5 and GPT-5.1 set to high reasoning achieve the same Intelligence Index score, but the former uses 48 million tokens, while the latter uses 81 million.

Subscribe to The Batch