Benchmarks

41 Posts

Table shows Gemini 3 Pro leading in benchmarks, outperforming Gemini 2.5, Claude Sonnet 4.5, and GPT-5.1.
Benchmarks

Google Dominates Arena Leaderboards (For the Moment): Gemini 3 Pro and Nano Banana Pro boast best-in-class multimodal reasoning and image generation

Google introduced Gemini 3 Pro and Nano Banana Pro, its flagship vision-language and image-generation models, and deployed them to billions of users worldwide.
Series of graphs transformed via tokenization and transformer layers, resulting in predicted outputs.
Benchmarks

Forecasting Multiple Time Series: Amazon’s Chronos-2 sorts out tangled variables to make better predictions

Transformers are well suited to predicting future values of time series like energy prices, wages, or weather, but often — as in those examples — multiple time series often influence one another. Researchers built a model that can forecast multiple time series simultaneously.
Energy-Based Transformer refines predictions step by step, lowering energy for higher context compatibility.
Benchmarks

Transformers Energized: Energy-Based Transformers (EBTs) use gradient descent to gradually predict the next token

A new type of transformer can check its work. Instead of guessing the next output token in one shot like a typical transformer, it starts with a rough version of the token and improves it step by step.
Diagram showing SWE-Smith AI pipeline for generating synthetic coding tasks from real repositories using multiple strategies
Benchmarks

Training Data for Coding Assistants: Stanford and Alibaba build bug fixing dataset and pipeline to train AI

A bottleneck in fine-tuning large language models for software engineering is building a dataset that can show them how to edit code, search for subroutines, write test scripts, control a terminal, manage a file system, and so on. Researchers built a pipeline that produces such data automatically.
Bar chart that compares the costs to run a suite of popular benchmarks on reasoning and non-reasoning AI models.
Benchmarks

Benchmarking Costs Climb: Reasoning LLMs Are Pricey to Test

An independent AI test lab detailed the rising cost of benchmarking reasoning models.
Diagram of FP4 training scheme showing BF16 tensor quantization and FP4 tensor core processing for efficient computation.
Benchmarks

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.
AI diagram showing generator and scorer loop to produce final output based on test image of a cat.
Benchmarks

Text-Only LLM Goes Multimodal: LLMs learn to caption images, video, and audio without further training

Large language models excel at processing text but can’t interpret images, video, or audio directly without further training on those media types. Researchers devised a way to overcome this limitation.
Comparison chart of GPT-4.1, o3, and o4-mini with other models on coding, math, tool use, and multimodal reasoning benchmarks.
Benchmarks

OpenAI Launches Cost-Effective Alternatives: OpenAI replaces GPT-4.5 with GPT-4.1 Family, plus o3 and o4-mini, new models focused on reasoning and coding

OpenAI refreshed its roster of models and scheduled the largest, most costly one for removal.
AI benchmark comparison chart showing Gemini 2.5 Pro, GPT-4.5, Claude, Grok, and others across science, math, code, and reasoning.
Benchmarks

Google Unveils Gemini 2.5: Google’s Gemini 2.5 Pro Experimental outperforms top AI models

Google’s new flagship model raised the state of the art in a variety of subjective and objective tests.
TabPFN neural network diagram showing synthetic training, prediction on real-world tabular data, and attention layers.
Benchmarks

Better Than Trees for Tabular Data: Transformers can outperform decision trees at predicting unlabeled spreadsheet cells

If you have a collection of variables that represent, say, a cancer patient and you want to classify the patient’s illness as likely cancer or not, algorithms based on decision trees, such as gradient-boosted trees, typically perform better than neural networks.
Table comparing Claude 3.7, 3.5, o1, o3-mini, DeepSeek R1, and Grok 3 Beta on reasoning, coding, tools, visuals, and math.
Benchmarks

Budget for Reasoning to the Token: Claude 3.7 Sonnet adds extended thinking mode

Anthropic’s Claude 3.7 Sonnet implements a hybrid reasoning approach that lets users decide how much thinking they want the model to do before it renders a response.
Table comparing GPT-4.5, GPT-4o, and o3-mini on GPQA, AIME 2024, MMLU, MMMU, and coding tests.
Benchmarks

OpenAI’s GPT-4.5 Goes Big: OpenAI releases GPT-4.5, its most powerful non-reasoning model and maybe its last

OpenAI launched GPT-4.5, which may be its last non-reasoning model.
Diagram of Localize-and-Stitch merging fine-tuned models by combining critical weights into one model.
Benchmarks

Better Performance From Merged Models: Localize-and-Stitch improves methods for merging and fine-tuning multiple models

Merging multiple fine-tuned models is a less expensive alternative to hosting multiple specialized models. But, while model merging can deliver higher average performance across several tasks, it often results in lower performance on specific tasks. New work addresses this issue.
o1 Family Benchmarks comparing pass rates across AIME, Codeforces, and GPQA.
Benchmarks

Higher Reasoning: OpenAI debuts o1 and pro mode for $200/month

OpenAI launched not only its highly anticipated o1 model but also an operating mode that enables the model to deliver higher performance — at a hefty price.
MLE-Bench workflow showing competition steps for model training, testing, and leaderboard scoring.
Benchmarks

When Agents Train Algorithms: OpenAI’s MLE-bench tests AI coding agents

Coding agents are improving, but can they tackle machine learning tasks? 
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox