Anthropic Opus 4.8 Leaps Forward Claude Opus 4.8 won back the high-performance crown for Anthropic, pending wider availability of its Mythos-class models

Published
Jul 1, 2026
Reading time
4 min read
Anthropic Opus 4.8 Leaps Forward: Claude Opus 4.8 won back the high-performance crown for Anthropic, pending wider availability of its Mythos-class models
Loading the Elevenlabs Text to Speech AudioNative Player...

Anthropic mid-2026 update of Opus held the the top of a leading intelligence ranking for about a week, only to be overtaken by Claude Fable 5.

What’s new: Claude Opus 4.8 arrived alongside features that enhance the Claude ecosystem: always-on dynamic reasoning; subagents that work in parallel, settings that control the reasoning level in the claude.ai web user interface and Cowork general-purpose agent, and the ability to update system-level prompts in mid-turn.

  • Input/output: Text and images in (up to 1 million tokens), text out (up to 128,000 tokens, 56 tokens per second)
  • Features: Adaptive thinking in which Claude Opus 4.8 decides whether and how much to reason per request, guided by five levels of effort (low, medium, high, extra, max); tool use including web search and computer use; context compaction for long-running tasks, fast mode that generates output tokens up to 2.5 times faster
  • Performance: Tops Artificial Analysis’s Intelligence Index, GDPval-AA, and Humanity’s Last Exam
  • Availability: Via Claude apps (Pro, Max, Team, Enterprise subscriptions), API $5/$0.50/$25 per million input/cached/output tokens plus cache storage costs, fast mode $10/$1/$50 per million input/cached/output tokens (one-third the fast-mode price of Claude Opus 4.6 and Claude Opus 4.7)
  • Undisclosed: Parameter count, architecture, training details

How it works: Anthropic trained Claude Opus 4.8 on a mix of public information scraped from the web, public and private datasets, and synthetic data generated by other models. It fine-tuned the model to align its behavior with Claude’s constitution, a set of principles designed to guide the model’s responses.

  • Dynamic workflows, a Claude Code research preview, runs multiple copies of the model in parallel. Given a task, it plans the work, runs hundreds of subagents, and verifies their outputs before producing a response.
  • Adaptive thinking is always on. At each turn, the model decides whether to reason, answering simple lookups directly and reasoning through harder problems. The effort setting influences how many tokens the model uses.
  • An update to the Messages API lets developers insert system instructions at any time, changing the model’s instructions without resetting the prompt cache.
  • The company omitted a training component used for Claude Opus 4.7 that covered business skills and robustness against adversarial agents after finding that it had contributed to misbehavior including dishonesty.

Performance: On Artificial Analysis’ Intelligence Index, a composite of 10 benchmarks of economically useful tasks, Claude Opus 4.8 set to max achieved the highest score of any model tested.

  • Artificial Analysis found that the model achieved state-of-the-art measures on a few benchmarks. On GDPval-AA, an agentic knowledge-work evaluation, Claude Opus 4.8 set to max reasoning achieved (69 percent), ahead of GPT-5.5 set to xhigh reasoning and Claude Opus 4.7 set to max reasoning, which tied for second (63 percent). On Humanity’s Last Exam, a test of expert-level questions across mathematics, humanities, and natural sciences, Claude Opus 4.8 set to max reasoning (46 percent) narrowly topped Gemini 3.1 Pro Preview set to reasoning (45 percent).
  • On AA-Omniscience, a knowledge benchmark that awards points for correct answers and penalizes hallucinated guesses, Claude Opus 4.8 set to max reasoning (27) trailed only Gemini 3.1 Pro Preview set to reasoning (33).
  • In Anthropic’s tests, Claude Opus 4.8 was roughly four times less likely to overlook its flaws than Claude Opus 4.7. When asked to summarize a coding session that contained flaws, Claude Opus 4.8 flagged them 96.3 percent of the time.
  • Considering the cost to run the Intelligence Index, Claude Opus 4.8 set to max ($4,686) is the second most costly, falling between Claude Opus 4.7 set to max reasoning ($5,117) and GPT-5.5 set to xhigh reasoning ($3,357).

Yes, but: Claude Opus 4.8 can often tell when it is being tested. Relatively early in training, it reasoned about how its outputs would be graded, which Anthropic views as a sign it may favor the appearance of success over the real thing. Independent testing found the same. The UK AI Security Institute reported that Claude Opus 4.8 distinguished real deployment data from synthetic recreations of the same tasks 79 percent of the time, up from 68 percent for Claude Opus 4.7 and matching Claude Mythos Preview. During reinforcement learning, the model’s internal activations showed that it was aware of being graded in about 5 percent of training episodes sampled, even when neither prompts nor the model’s responses mentioned grading. Anthropic says the improved testing awareness did not affect the model’s behavior, and that it’s the best-aligned Anthropic model available to the public.

Behind the news: Founded in 2021, Anthropic has rapidly joined the ranks of the world’s most valuable private companies. On the same day it launched Claude Opus 4.8, it announced that it had achieved a valuation $965 billion — greater than that of OpenAI. Days later, it filed documents to pursue an initial public offering.

Why it matters: Anthropic improved Claude Opus 4.8’s honesty at least partly by skipping a specific training procedure. That is, removal of fine-tuning in business yielded a more trustworthy model. This experience raises the questions: Which other knowledge domains incur a cost in honesty, and will users embrace models that are less well versed in them?

We’re thinking: Artificial Analysis’ AA-Omniscience test rewards a model for one kind of honesty: admitting ignorance of facts rather than inventing them. Anthropic built Claude Opus 4.8 for a different kind: admitting that it made mistakes after it has performed a task. Developers will appreciate that kind of honesty as AI agents take on larger jobs.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox