Artificial Analysis, which tests AI systems, updated the component evaluations in its Intelligence Index to better reflect large language models’ performance in real-world use cases.
What’s new: The Artificial Analysis Intelligence Index v4.0, an average of 10 benchmarks of model performance, exchanges three widely used tests that top LLMs have largely mastered for less-familiar tests. The new benchmarks measure models’ abilities to do economically useful work, recall facts without guessing, and reason through problems. GPT-5.2 set to extra high reasoning, which scored 51, currently leads the index, followed by Claude Opus 4.5 with reasoning enabled (49) and Gemini 3 Pro Preview set to high reasoning (48). GLM-4.7 (42) leads open-weights LLMs. (Disclosure: Andrew Ng has an investment in Artificial Analysis.)
How it works: The Intelligence Index evaluates LLMs using zero-shot English text inputs. Artificial Analysis feeds identical prompts to models at different reasoning and temperature settings. Its tool-execution code gives models access only to a bash terminal and the web. For version 4.0, the company removed MMLU-Pro (answering questions based on general knowledge), AIME 2025 (competition math problems), and LiveCodeBench (competition coding tasks). In their place, it added these benchmarks:
- GDPval-AA tests a model’s ability to produce documents, spreadsheets, diagrams, and the like. Artificial Analysis asks two models to respond to the same prompt, uses Gemini 3 Pro as a judge to rank their outputs as a win, loss, or tie, and computes Elo ratings. Of the models tested, GPT-5.2 at extra high reasoning (1428 Elo) led the pack, followed by Claude Opus 4.5 (1399 Elo) and GLM-4.7 (1185 Elo).
- AA-Omniscience poses questions in a wide variety of technical fields and measures a model’s ability to return correct answers without hallucinating. The test awards positive points for factual responses, negative points for false information, and zero points for refusals to answer, arriving at a score between 100 (perfect) and -100. Of the current crop, only five models achieved a score greater than 0. Gemini 3 Pro Preview (13) did best, followed by Claude Opus 4.5 (10) and Grok 4 (1). Models that achieved the highest rate of accurate answers often achieved poor numbers because their metrics were dragged down by high rates of hallucination (incorrect answers divided by total answers). For example, Gemini 3 Pro Preview achieved 54 percent accuracy with an 88 percent hallucination rate, while Claude Opus 4.5 achieved lower accuracy (43 percent) but also a lower hallucination rate (58 percent).
- CritPt asks models to solve 71 unpublished, PhD-level problems in physics. All current models struggle with this benchmark. GPT-5.2 (11.6 percent accuracy) achieved the highest marks. Gemini 3 Pro Preview (9.1 percent) and Claude Opus 4.5 (4.6 percent) followed. Several models failed to solve a single problem.
- The index retained seven tests: 𝜏²-Bench Telecom (which tests the ability of conversational agents to collaborate with users in technical support scenarios), Terminal-Bench Hard (command-line coding and data processing tasks), SciCode (scientific coding challenges), AA-LCR (long-context reasoning), IFBench (following instructions), Humanity's Last Exam (answering expert-level, multidisciplinary, multimodal questions), and GPQA Diamond (answering questions about graduate-level biology, physics, and chemistry).
Behind the news: As LLMs have steadily improved, several benchmarks designed to challenge them have become saturated; that is, the most capable LLMs achieve near-perfect results. The three benchmarks replaced by Artificial Analysis suffer from this problem. They’re of little use for evaluating model capabilities, since they leave little room for improvement. For instance, on MMLU-Pro, Gemini 3 Pro Preview achieved 90.1 percent, and on AIME 2025, GPT-5.2 achieved 96.88 percent, and Gemini 3 Pro Preview achieved 96.68 percent. One reason may be contamination of the models’ training data with the benchmark test sets, which would have enabled the model to learn the answers prior to being tested. In addition, researchers may consciously or subconsciously tune their models to specific test sets and thus accelerate overfitting to specific benchmarks.
Why it matters: The Intelligence Index has emerged as an important measuring stick for LLM performance. However, to remain useful, it must evolve along with the technology. Saturated benchmarks need to be replaced by more meaningful measures. The benchmarks included in the new index aren’t just less saturated or contaminated, they also measure different kinds of performance. In the last year, tests of mathematics, coding, and general knowledge have become less definitive as models have gained the ability to create documents, reason through problems, and generate reliable information without guessing. The new index rewards more versatile models and those that are likely to be more economically valuable going forward.
We’re thinking: This tougher test suite leaves more room for models to improve but still doesn’t differentiate models enough to break up the pack. Overall, leaders remain neck-and-neck except in a few subcategories like hallucination rates (measured by AA-Omniscience) and document creation (measured by GDPval-AA).