OpenAI Launches Cost-Effective Alternatives OpenAI replaces GPT-4.5 with GPT-4.1 Family, plus o3 and o4-mini, new models focused on reasoning and coding

Published
Reading time
3 min read
Comparison chart of GPT-4.1, o3, and o4-mini with other models on coding, math, tool use, and multimodal reasoning benchmarks.
Loading the Elevenlabs Text to Speech AudioNative Player...

OpenAI refreshed its roster of models and scheduled the largest, most costly one for removal.

What’s new: OpenAI introduced five new models that accept text and images inputs and generate text output. Their parameter counts, architectures, training datasets, and training methods are undisclosed. The general-purpose GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano are available via API only. The reasoning models o3 and o4-mini, are available via API to qualified developers as well as users of ChatGPT Plus, Pro, and Team, and soon ChatGPT Enterprise and ChatGPT Education. The company will terminate GPT-4.5 — which it introduced as a research preview in late February — in July.

GPT-4.1 family: In an odd turn of version numbers, the GPT-4.1 models are intended to be cost-effective equivalents to GPT-4.5 and updates to GPT-4o. They accept inputs of up to 1 million tokens (compared to GPT-4.5’s and GPT-4o’s 128,000 tokens).

  • Prices: GPT-4.1 costs $2/$8 per million input/output tokens. GPT-4.1 mini costs $0.40/$1.60 per million input/output tokens. GPT-4.1 nano costs $0.10/$0.40 per million input/output tokens. A 75 percent discount applies to cached input tokens.
  • GPT-4.1 performance: GPT-4.1 surpassed GPT-4o on most benchmarks tested by OpenAI, with notable improvement on coding tasks. It significantly outperformed GPT-4o, o1, and o3-mini on SWE-bench Verified (real-world coding skills), MultiChallenge⁠ (following instructions in multi-turn conversations), MMMU (multimodal reasoning), and Video-MME (long-context understanding).
  • GPT-4.1 mini performance: The smaller GPT-4.1 mini generally surpassed GPT-4o mini on benchmarks tested by OpenAI. On MultiChallenge and MMMU, GPT-4.1 mini outperformed the full-size GPT-4o.

o3 and o4-mini: These models update o1 and o3-mini, respectively. They have input limits of 200,000 tokens and can be set to low-, medium-, or high-effort modes to process varying numbers of reasoning tokens, which are hidden from users. Unlike their predecessors, they were fine-tuned to decide when and how to use the tools, including web search, code generation and execution, and image editing.

  • Prices: API access to o3 costs $10/$40 per million input/output tokens. o4-mini costs $1.10/$4.40 per million input/output tokens. Both offer a 75 percent discount for cached input tokens.
  • Access limits: Developers whose usage puts them in rate-limit tiers 1 through 3 must verify their identities to use o3 via the API (higher-usage tiers 4 and 5 are exempt). OpenAI says this limitation is intended to prevent abuse.
  • Image processing: o3 and o4-mini can apply chains of thought to images — a first for OpenAI’s reasoning models. For example, users can upload a diagram with instructions to interpret it, and the models will use chains of thought and tools to process the diagram.
  • o3 performance: o3 set the state of the art in several benchmarks including MultiChallenge, MMMU, MathVista, and HLE. It generally outperformed o1 in tests performed by OpenAI. OpenAI didn’t document o3’s long-context performance, but in independent tests by Fiction.Live, it achieved nearly perfect accuracy with contexts up to 120,000 tokens.
  • o4-mini performance: o4-mini generally outperformed o3-mini in tests performed by OpenAI. It outperformed most competing models in Fiction.Live’s tests of long-context performance.

Behind the news: Late last year, OpenAI introduced o1, the first commercial model trained via reinforcement learning to generate chains of thought. Within a few months, DeepSeek, Google, and Anthropic launched their respective reasoning models DeepSeek-R1Gemini 2.5 Pro, and Claude 3.7 Sonnet. OpenAI has promised to integrate its general-purpose GPT-series models and o-series reasoning models, but they remain separate for the time being.

Why it matters: GPT-4.5 was an exercise in scale, and it showed that continuing to increase parameter counts and training data would yield ongoing performance gains. But it wasn’t widely practical on a cost-per-token basis. The new models, including those that use chains of thought and tools, deliver high performance at lower prices.

We’re thinking: Anthropic is one of OpenAI’s key competitors, and a large fraction of the tokens it generates (via API) are for writing code, a skill in which it is particularly strong. OpenAI’s emphasis on models that are good at coding could boost the competition in this area!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox