Claude Does More With Fewer Tokens Claude Opus 4.5 retakes the coding crown at one-third the price of its predecessor

Published

Dec 10, 2025

Reading time

3 min read

Claude Opus 4.5, the latest version of Anthropic’s flagship model, extends the earlier version’s strengths in coding, computer use, and agentic workflows while generating fewer tokens.

What’s new: Claude Opus 4.5 outperforms its immediate predecessor at one-third the price per token.

Input/output: Text and images in (up to 200,000 tokens), text out (up to 64,000 tokens)
Features: Adjustable effort (low, medium, high) that governs token generation across responses, tool calls, and reasoning; extended thinking that raises the budget for reasoning tokens; tool use including web search and computer use
Availability/price: Comes with Claude apps (Pro, Max, Team, Enterprise subscriptions); API $5.00/$0.50/$25.00 per million input/cached/output tokens (plus cache storage costs) via Anthropic, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry
Undisclosed: Parameter count, architecture, training details

How it works: Anthropic describes Claude Opus 4.5 as a hybrid reasoning model. Like Claude models since Claude Sonnet 3.7, it responds rapidly in its default mode or takes time to process reasoning tokens when extended thinking is enabled.

Anthropic trained the model on public data scraped from the web and non-public data from third parties, paid contractors, Anthropic users who didn’t opt out, and Anthropic's internal operations. The team fine-tuned the model to be helpful using reinforcement learning from human and AI feedback.
Claude’s consumer apps now automatically summarize earlier portions of conversations, enabling arbitrarily long interactions.

Performance: In independent tests performed by Artificial Analysis, Claude Opus 4.5 excelled at coding tasks and performed near the top in other areas. In Anthropic’s tests, it attained high performance while using tokens efficiently.

On the Artificial Analysis Intelligence Index, a weighted average of 10 benchmarks, Claude Opus 4.5 (70) achieved a second-place score, matching OpenAI GPT-5.1 and trailing Google Gemini 3 Pro (73). In non-reasoning mode, it scored 60, highest among non-reasoning models tested. On the AA-Omniscience Index, which measures factual knowledge and tendency to fabricate information (higher is better), Claude Opus 4.5 (10) outperformed GPT-5.1 (2) but lagged behind Gemini 3 Pro Preview (13).
On Terminal-Bench Hard (command-line tasks), Claude Opus 4.5 (44 percent) outperformed all other models tested by Artificial Analysis.
According to Anthropic, set to medium effort, Claude Opus 4.5 matched Sonnet 4.5's SWE-bench Verified performance while using 76 percent fewer output tokens. At high effort, it exceeded Sonnet 4.5 by 4.3 percentage points while using 48 percent fewer tokens.
Using “parallel test-time compute” that included a 64,000-token thinking budget and high effort, Claude Opus 4.5 outperformed all people who have taken a two-hour engineering exam that Anthropic uses to test candidates.

Behind the news: Generally, Claude Opus 4.5 generates fewer output tokens than competitors to achieve comparable results. To run the tests in the Artificial Analysis Intelligence Index, Claude Opus 4.5 (48 million tokens) used roughly half as many as Gemini 3 Pro set to high reasoning (92 million tokens) and GPT-5.1 set to high reasoning (81 million tokens). However, its higher per-token price amounts to higher overall costs than these competitors. Testing Claude Opus 4.5 cost $1,498, Gemini 3 Pro $1,201, and GPT-5.1 $859.

Why it matters: Claude Opus 4.5 arrives after a period in which Anthropic’s mid-tier Sonnet 4.5 often approached or outperformed the older, more expensive Opus 4.1 in many benchmarks. For instance, on the Artificial Analysis Intelligence Index, Claude Sonnet 4.5 (63) exceeded Claude Opus 4.1 (59). The disparity disincentivized users to pay premium rates for Opus for some time. But Claude Opus 4.5 restores a clear hierarchy within the Claude family, with the top-tier model now 7 points ahead of its mid-tier sibling.

We’re thinking: The difference in performance between various frontier models is shrinking. According to Stanford’s latest AI Index, the gap between top-ranked and 10th-ranked models on LM Arena, measured by Elo rating, fell from 11.9 percent to 5.4 percent between 2024 and 2025. The gap between the top two shrank to 0.7 percent. As this trend continues, leaderboard differences are coming to matter less for many applications.

Subscribe to The Batch