Anthropic continued its tradition of building AI models that raise the bar in coding tasks.
What’s new: Anthropic launched Claude 4 Sonnet 4 and Claude Opus 4, the latest medium- and largest-size members of its family of general-purpose large language models. Both models offer an optional reasoning mode and can use multiple tools in parallel while reasoning. In addition, the company made generally available Claude Code, a coding agent previously offered as a research preview, along with a Claude Code software development kit.
- Input/output: Text, images, PDF files in (up to 200,000 tokens); text out (Claude Sonnet 4 up to 64,000 tokens, Claude Opus 4 up to 32,000 tokens)
- Features: Parallel tool use including computer use, selectable reasoning mode with visible reasoning tokens, multilingual (15 languages)
- Performance: Ranked Number One in LMSys WebDev Arena, state-of-the-art on SWE-bench and Terminal-bench
- Availability/price: Anthropic API, Amazon Bedrock, Google Cloud Vertex AI. Claude Sonnet 4 $3/$15 per million input/output tokens, Claude Opus 4 $15/$75 per million input/output tokens
- Undisclosed: Parameter counts, specific training methods and datasets
How it works: The team trained the Claude 4 models on a mix of publicly available information on the web as well as proprietary purchased data, data from Claude users who opted to share their inputs and outputs, and generated data. They fine-tuned the models to be helpful, honest, and harmless according to human and AI feedback.
- The models make reasoning tokens visible within limits. For especially lengthy chains of thought, an unspecified smaller model summarizes reasoning tokens.
- Given local file access, Claude Opus 4 can create and manipulate files to store information. For instance, prompted to maintain a knowledge base while playing a Pokémon video game, the model produced a guide to the game that offered advice such as, “If stuck, try OPPOSITE approach” and “Change Y-coordinate when horizontal movement fails.”
Results: Both Claude 4 models tied Google Gemini 2.5 Pro at the top of the LMSys WebDev Arena and achieved top marks for coding and agentic computer-use benchmarks in Anthropic’s tests.
- On SWE-bench Verified, which tests the model’s ability to solve software issues from GitHub, Claude Opus 4 succeeded 72.5 percent of the time, and Claude Sonnet 4 succeeded 72.7 percent of the time. The next best model, OpenAI o3, succeeded 70.3 percent of the time.
- Terminal-bench evaluates how well models work with the benchmark’s built-in agentic framework to perform tasks on a computer terminal. Claude Opus 4 succeeded 39.2 percent of the time and Claude Sonnet 4 succeeded 33.5 percent of the time, whereas the closest competitor, OpenAI GPT 4.1, succeeded 30.3 percent of the time. Using Claude Code as the agentic framework, Claude Opus 4 succeeded 43.2 percent of the time and Claude Sonnet 4 succeeded 35.5 percent of the time.
Why it matters: The new models extend LLM technology with parallel tool use, using external files as a form of memory, and staying on-task over unusually long periods of time. Early users have reported many impressive projects, including a Tetris clone built in one shot and a seven-hour stint refactoring Rakutan’s open-source code base.
We’re thinking: Prompting expert @elder_plinius published a text file that is purported to be Claude 4’s system prompt and includes some material that does not appear in Anthropic’s own publication of the prompts. It is instructive to see how it conditions the model for tool use, agentic behavior, and reasoning.