Benchmarks

50 Posts

A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.
Benchmarks

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.
Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.
Benchmarks

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.
The chart compares AI benchmark efforts with employment and capital in U.S. job sectors, highlighting discrepancies.
Benchmarks

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

AI agents seem to be increasingly capable of performing economically valuable tasks, but current benchmarks measure this capability only narrowly.
GPT-5.5 leads in Terminal-Bench 2.0 with 82.7% score, highlighting performance contrast against competitors.
Benchmarks

GPT-5.5 Outperforms, Hallucinates: OpenAI’s latest model tops leaderboards for coding, visual puzzles, and overall intelligence

The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.
Alibaba's latest flagship models are open-weights MoE performers in sizes from less than 1B parameters
Benchmarks

Qwen3.5 Outperforms Bigger Models, Leads Vision Benchmarks: Alibaba’s latest flagship models are open-weights MoE performers in sizes from less than 1B parameters

The Qwen3.5 family of open-weights vision-language models includes impressive larger models as well as a smaller one that outperforms an OpenAI open-weights model 10 times its size.
AI models’ performance shown in bars; GPT-5.2 highest at 51, reflecting updated benchmarks.
Benchmarks

Artificial Analysis Revamps Intelligence Index: Independent AI testing authority turns from saturated knowledge benchmarks to harder business tests

Artificial Analysis, which tests AI systems, updated the component evaluations in its Intelligence Index to better reflect large language models’ performance in real-world use cases.
A table compares GPT-5.2's benchmark scores to Claude Opus 4.5 and Gemini 3 Pro in various reasoning tasks.
Benchmarks

OpenAI’s Answer to Gemini 3: GPT-5.2 arrives, touting variable reasoning and coding performance

OpenAI launched GPT-5.2 only weeks after its CEO Sam Altman reportedly issued a “code red” alarm in response to Google's Gemini 3.
Flowchart showing Tiny Recursive Model process with stages: input, prediction, and latent refinement.
Benchmarks

Small Models Solve Hard Puzzles: Tiny Recursive Model beats larger competitors at games like Sudoku and Maze

Large language models often fail at puzzles like Sudoku, for which a solution includes multiple elements and a single mistake invalidates all of them. Researchers showed that a tiny network, by repeatedly refining its solution, can solve this sort of puzzle well.
Table highlights Opus 4.5’s superior scores in coding and reasoning compared to other AI models.
Benchmarks

Claude Does More With Fewer Tokens: Claude Opus 4.5 retakes the coding crown at one-third the price of its predecessor

Claude Opus 4.5, the latest version of Anthropic’s flagship model, extends the earlier version’s strengths in coding, computer use, and agentic workflows while generating fewer tokens.
Table shows Gemini 3 Pro leading in benchmarks, outperforming Gemini 2.5, Claude Sonnet 4.5, and GPT-5.1.
Benchmarks

Google Dominates Arena Leaderboards (For the Moment): Gemini 3 Pro and Nano Banana Pro boast best-in-class multimodal reasoning and image generation

Google introduced Gemini 3 Pro and Nano Banana Pro, its flagship vision-language and image-generation models, and deployed them to billions of users worldwide.
Series of graphs transformed via tokenization and transformer layers, resulting in predicted outputs.
Benchmarks

Forecasting Multiple Time Series: Amazon’s Chronos-2 sorts out tangled variables to make better predictions

Transformers are well suited to predicting future values of time series like energy prices, wages, or weather, but often — as in those examples — multiple time series often influence one another. Researchers built a model that can forecast multiple time series simultaneously.
Energy-Based Transformer refines predictions step by step, lowering energy for higher context compatibility.
Benchmarks

Transformers Energized: Energy-Based Transformers (EBTs) use gradient descent to gradually predict the next token

A new type of transformer can check its work. Instead of guessing the next output token in one shot like a typical transformer, it starts with a rough version of the token and improves it step by step.
Diagram showing SWE-Smith AI pipeline for generating synthetic coding tasks from real repositories using multiple strategies
Benchmarks

Training Data for Coding Assistants: Stanford and Alibaba build bug fixing dataset and pipeline to train AI

A bottleneck in fine-tuning large language models for software engineering is building a dataset that can show them how to edit code, search for subroutines, write test scripts, control a terminal, manage a file system, and so on. Researchers built a pipeline that produces such data automatically.
Bar chart that compares the costs to run a suite of popular benchmarks on reasoning and non-reasoning AI models.
Benchmarks

Benchmarking Costs Climb: Reasoning LLMs Are Pricey to Test

An independent AI test lab detailed the rising cost of benchmarking reasoning models.
Diagram of FP4 training scheme showing BF16 tensor quantization and FP4 tensor core processing for efficient computation.
Benchmarks

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox