OpenAI refreshed its roster of models and scheduled the largest, most costly one for removal.
What’s new: OpenAI introduced five new models that accept text and images inputs and generate text output. Their parameter counts, architectures, training datasets, and training methods are undisclosed. The general-purpose GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano are available via API only. The reasoning models o3 and o4-mini, are available via API to qualified developers as well as users of ChatGPT Plus, Pro, and Team, and soon ChatGPT Enterprise and ChatGPT Education. The company will terminate GPT-4.5 — which it introduced as a research preview in late February — in July.
GPT-4.1 family: In an odd turn of version numbers, the GPT-4.1 models are intended to be cost-effective equivalents to GPT-4.5 and updates to GPT-4o. They accept inputs of up to 1 million tokens (compared to GPT-4.5’s and GPT-4o’s 128,000 tokens).
- Prices: GPT-4.1 costs $2/$8 per million input/output tokens. GPT-4.1 mini costs $0.40/$1.60 per million input/output tokens. GPT-4.1 nano costs $0.10/$0.40 per million input/output tokens. A 75 percent discount applies to cached input tokens.
- GPT-4.1 performance: GPT-4.1 surpassed GPT-4o on most benchmarks tested by OpenAI, with notable improvement on coding tasks. It significantly outperformed GPT-4o, o1, and o3-mini on SWE-bench Verified (real-world coding skills), MultiChallenge (following instructions in multi-turn conversations), MMMU (multimodal reasoning), and Video-MME (long-context understanding).
- GPT-4.1 mini performance: The smaller GPT-4.1 mini generally surpassed GPT-4o mini on benchmarks tested by OpenAI. On MultiChallenge and MMMU, GPT-4.1 mini outperformed the full-size GPT-4o.
o3 and o4-mini: These models update o1 and o3-mini, respectively. They have input limits of 200,000 tokens and can be set to low-, medium-, or high-effort modes to process varying numbers of reasoning tokens, which are hidden from users. Unlike their predecessors, they were fine-tuned to decide when and how to use the tools, including web search, code generation and execution, and image editing.
- Prices: API access to o3 costs $10/$40 per million input/output tokens. o4-mini costs $1.10/$4.40 per million input/output tokens. Both offer a 75 percent discount for cached input tokens.
- Access limits: Developers whose usage puts them in rate-limit tiers 1 through 3 must verify their identities to use o3 via the API (higher-usage tiers 4 and 5 are exempt). OpenAI says this limitation is intended to prevent abuse.
- Image processing: o3 and o4-mini can apply chains of thought to images — a first for OpenAI’s reasoning models. For example, users can upload a diagram with instructions to interpret it, and the models will use chains of thought and tools to process the diagram.
- o3 performance: o3 set the state of the art in several benchmarks including MultiChallenge, MMMU, MathVista, and HLE. It generally outperformed o1 in tests performed by OpenAI. OpenAI didn’t document o3’s long-context performance, but in independent tests by Fiction.Live, it achieved nearly perfect accuracy with contexts up to 120,000 tokens.
- o4-mini performance: o4-mini generally outperformed o3-mini in tests performed by OpenAI. It outperformed most competing models in Fiction.Live’s tests of long-context performance.
Behind the news: Late last year, OpenAI introduced o1, the first commercial model trained via reinforcement learning to generate chains of thought. Within a few months, DeepSeek, Google, and Anthropic launched their respective reasoning models DeepSeek-R1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. OpenAI has promised to integrate its general-purpose GPT-series models and o-series reasoning models, but they remain separate for the time being.
Why it matters: GPT-4.5 was an exercise in scale, and it showed that continuing to increase parameter counts and training data would yield ongoing performance gains. But it wasn’t widely practical on a cost-per-token basis. The new models, including those that use chains of thought and tools, deliver high performance at lower prices.
We’re thinking: Anthropic is one of OpenAI’s key competitors, and a large fraction of the tokens it generates (via API) are for writing code, a skill in which it is particularly strong. OpenAI’s emphasis on models that are good at coding could boost the competition in this area!