GPT-5.5 Outperforms, Hallucinates OpenAI’s latest model GPT-5.5 tops leaderboards for coding, visual puzzles, and overall intelligence

Published

May 01, 2026

Reading time

4 min read

The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.

What’s new: GPT-5.5 is a closed vision-language models that’s built for agentic coding, computer use, and knowledge work. GPT-5.5 Pro is the same model but processes reasoning tokens in parallel during inference. OpenAI set the API prices at roughly double the per-token rates of GPT-5.4.

Input/output: Text and images in (up to 1 million tokens via API, 400,000 tokens in Codex), text out (up to 128,000 tokens)
Features: Five levels of reasoning (xhigh, high, medium, low, none), tool use, web search, structured outputs, tool search (API only, loads tools on demand rather than all at once), Fast mode (Codex only, generates tokens 1.5 times faster at 2.5 times the price)
Performance: Tops Artificial Analysis Intelligence Index and ARC-AGI-2
Availability/price: GPT-5.5 available in ChatGPT with Plus, Pro, Business, or Enterprise subscription and in Codex for those tiers plus Edu and Go; GPT-5.5 Pro available in ChatGPT with Pro, Business, or Enterprise subscription: GPT-5.5 API $5/$0.50/$30 per million tokens of input/cached/output, GPT-5.5 Pro API $30/$180 per million tokens of input/output with no cached discount
Undisclosed: Architecture, parameter count, training data and methods

How it works: OpenAI disclosed few details about how it built GPT-5.5. As is typical of high-performance models, the training data was a mix of publicly available data scraped from the web, licensed from partners, and collected from users and human trainers. The model was trained via reinforcement learning to reason before responding.

Performance: GPT-5.5 generally delivers top performance in objective benchmarks, especially in tests of knowledge, agentic tasks, and abstract visual reasoning. However, it falls behind competitors on subjective evaluations. It’s also more likely to confidently deliver incorrect output.

GPT-5.5 set to xhigh reasoning tops the indepedent Artificial Analysis Intelligence Index, a composite of 10 tests of economically useful tasks, with a score of 60 points. Claude Opus 4.7 set to max reasoning and Gemini 3.1 Pro Preview set to reasoning are tied at 57 points.
On ARC-AGI-2, visual puzzles that test abstract reasoning, GPT-5.5 set to xhigh (85.0 percent at $1.87 per task) displaced the previous leader Gemini 3 Deep Think (84.6 percent at $13.62 per task) at a substantially lower cost per task.
In OpenAI’s tests, GPT-5.5 set state-of-the-art scores on Terminal-Bench 2.0 (command-line workflows that require planning and tool use), OSWorld-Verified (autonomous operation of real computer interfaces), and Tau2-bench Telecom (multi-turn customer-service workflows).
On AA-Omniscience Accuracy, a knowledge benchmark that rewards factual recall, GPT-5.5 set to xhigh reasoning posted the highest accuracy at 57 percent. However, on the AA-Omniscience Index, which rewards models for answering correctly and acknowledging ignorance but penalizes them for confidently making mistakes, GPT-5.5 set to xhigh reasoning (20 points) ranked third, behind Gemini 3.1 Pro Preview (33 points) and Claude Opus 4.7 set to max reasoning (26 points).
On Arena.ai’s leaderboards, which rank models by blind head-to-head comparisons, GPT-5.5 falls well behind competitors. Claude Opus models occupy the top spots across most categories. For instance, as of April 27, GPT-5.5-high ranked seventh in Text Arena and ninth in Code Arena WebDev.

Yes, but: GPT-5.5 knows more than its peers, but it answers incorrectly more often and acknowledges ignorance less often. The AA-Omniscience benchmark poses 6,000 expert-level questions across business, law, health, humanities, science/engineering, and software engineering. It includes a “hallucination rate” that is the ratio of wrong answers to the sum of wrong answers, partially wrong answers, and abstentions. By this measure, GPT-5.5 set to high reasoning hit 85.53 percent, notably worse than Claude Opus 4.7 set to max reasoning (36.18 percent) and Gemini 3.1 Pro Preview at (49.87 percent). Apollo Research separately found that GPT-5.5 lied about completing an impossible programming task in 29 percent of samples, a significant jump from GPT-5.4’s 7 percent. OpenAI’s internal monitoring of coding-agent traffic showed a similar pattern.

Security implications: OpenAI released results of VulnLMP, an internal evaluation that tests whether a model can develop exploits against widely deployed software. GPT-5.5 undertook multi-day research campaigns and identified potential memory-related vulnerabilities in a variety of targets, but it did not produce an exploit that was confirmed by OpenAI’s evaluation harness. Under OpenAI’s Preparedness Framework, this evidence places GPT-5.5 within the “high” tier of cybersecurity threats, short of the “critical” tier label that would describe models that independently produce working exploits against real targets.

Why it matters: Evaluations of objective performance and human preferences are telling different stories about GPT-5.5. OpenAI regained the lead on the Artificial Analysis Intelligence Index, but the picture flips when it comes to subjective, head-to-head comparisons. Claude Opus models occupy the top spots in LMArena’s Text, Vision, Document, Search, and Code rankings, while GPT-5.5 doesn’t crack the top five on most. Benchmarks measure what models can accomplish, human preference what they’re like to work with. Production decisions usually weigh both, and — according to the measures that are available so far — the two are diverging.

We’re thinking: Top AI companies continue to push the frontier at a dizzying pace. GPT-5.5 is the fourth flagship launch since February, following Anthropic Claude Opus 4.7, GPT-5.4, and Google Gemini 3.1 Pro Preview. Each one reshuffled the top of the Artificial Analysis Intelligence Index, which can be viewed as a proxy for general capability in real-world tasks. Developers should design their software stacks to swap models as easily as bumping a dependency.

Subscribe to The Batch