More Reasoning for Harder Problems OpenAI debuts o3-pro, an updated reasoning model that applies more tokens at inference

Published
Reading time
2 min read
OpenAI o3-pro outperforms o3 and o1-pro on math, science, and coding benchmarks, but responds much more slowly.
Loading the Elevenlabs Text to Speech AudioNative Player...

OpenAI launched o3-pro, a more capable version of its most advanced reasoning vision-language model.

What’s new: o3-pro is designed to respond to difficult challenges involving science, mathematics, and coding. But its reasoning firepower dramatically slows response times.

  • Input/output: Text and images in (up to 200,000 tokens), text out (up to 100,000 tokens, 20.7 tokens per second, 129.2 seconds to first token)
  • Knowledge cutoff: June 1, 2024
  • Features: Function calling including web search, structured output
  • Availability/price: Available to ChatGPT Pro and Team users via OpenAI API, soon to Enterprise and Edu users, for $20/$80 per 1 million tokens of input/output
  • Undisclosed: Details about architecture, training data, and training methods

Performance: o3-pro outperformed OpenAI’s own o3 (set to medium effort) and o1-pro in tests performed by OpenAI.

  • Solving AIME 2024’s advanced high-school math competition problems on the first try, o3-pro (93 percent) bested o3 (90 percent) and o1-pro (86 percent).
  • Answering GPQA Diamond’s graduate-level science questions on the first try, o3-pro (85 percent) outperformed o3 (81 percent) and o1-pro (79 percent).
  • Completing Codeforces competition-coding problems in one pass, o3-pro (2748 CodeElo) surpassed o3 (2517 CodeElo) and o1-pro (1707 CodeElo).
  • In qualitative tests, human reviewers consistently preferred o3-pro over o3 for queries related to scientific analysis (64.9 percent), personal writing (66.7 percent), computer programming (62.7 percent), and data analysis (64.3 percent).

What they’re saying: Reviews of o3-pro so far generally are positive, but the model has been criticized for the time it takes to respond. Box CEO Aaron Levie commented that o3-pro is “crazy good at math and logic.” However, entrepreneur Yuchen Jin noted that it’s the “slowest and most overthinking model.”

Behind the news: OpenAI rolled out o3-pro with a lower price, $20/$80 per 1 million input/output tokens, than o1-pro (which was priced at $150/$600 per 1 million input/output tokens but was deprecated in favor of the new model). Simultaneously it cut the price of o3 by 80 percent to $2/$8 per 1 million input/output tokens. These moves continue the plummeting price of inference over the past year. DeepSeek-R1 offers performance that approaches that of top models for $0.55/$2.19 per 1 million input/output tokens.

Why it matters: OpenAI is pushing the limits of current approaches to reasoning, and the results are promising if incremental. o3-pro’s extensive reasoning may appeal to developers who are working on the multi-step scientific problems. For many uses, though, the high price and slow speed may be a dealbreaker.

We’re thinking: Letting developers choose between o3 and o3-pro lets them calibrate their computational budget to the difficulty of the task at hand. What if we want to do the same with a trained, open-weights, large language model? Forcing an LLM to generate “Wait” in its output causes it to keep thinking, and can improve its output significantly.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox