Benchmarks

22 Posts

Image Generators in the Arena: Text-to-image generators face off in arena leaderboard by Artificial Analysis

An arena-style contest pits the world’s best text-to-image generators against each other.

Challenging Human-Level Models: Hugging Face overhauls open LLM leaderboard with tougher benchmarks

An influential ranking of open models revamped its criteria, as large language models approach human-level performance on popular tests.

Benchmarks

Benchmarks for Agentic Behaviors: New LLM benchmarks for Tool Use and Planning in workplace tasks

Tool use and planning are key behaviors in agentic workflows that enable large language models (LLMs) to execute complex sequences of steps. New benchmarks measure these capabilities in common workplace tasks.

Safety, Evaluations and Alignment Lab (SEAL) Leaderboards.

Benchmarks

Private Benchmarks for Fairer Tests: Scale AI launches SEAL leaderboards to benchmark model performance

Scale AI offers new leaderboards based on its own benchmarks.

Benchmarks

Benchmarks for Industry: Vals AI evaluates large language models on industry-specific tasks.

How well do large language models respond to professional-level queries in various industry domains? A new company aims to find out.

Benchmarks

Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots

Training an agent that controls a robot arm to perform a task — say, opening a door — that involves a sequence of motions (reach, grasp, turn, pull, release) can take from tens of thousands to millions of examples...

Charts showing benchmark on medium-sized datasets

Benchmarks

When Trees Outdo Neural Networks: Decision Trees Perform Best on Most Tabular Data

While neural networks perform well on image, text, and audio datasets, they fall behind decision trees and their variations for tabular datasets. New research looked into why.

Benchmarks

Humanized Training for Robot Arms: New Research Improves Robot Performance and Adaptability

Robots trained via reinforcement learning usually study videos of robots performing the task at hand. A new approach used videos of humans to pre-train robotic arms.

Word cloud, chess positions given to the model as text and chart with % of suggested chess moves

Benchmarks

Toward Next-Gen Language Models: New Benchmarks Test the Limits of Large Language Models

A new benchmark aims to raise the bar for large language models. Researchers at 132 institutions worldwide introduced the Beyond the Imitation Game benchmark (BIG-bench), which includes tasks that humans perform well but current state-of-the-art models don’t.

Excerpts from the fifth annual AI Index from Stanford University’s Institute for Human-Centered AI

Benchmarks

AI Progress Report: Stanford University's fifth annual AI Report for 2022

A new study showcases AI’s growing importance worldwide. What’s new: The fifth annual AI Index from Stanford University’s Institute for Human-Centered AI documents rises in funding, regulation, and performance.

Graph showing information about different transformer models

Benchmarks

Transformer Variants Head to Head: A benchmark for comparing different AI transformers.

The transformer architecture has inspired a plethora of variations. Yet researchers have used a patchwork of metrics to evaluate their performance, making them hard to compare. New work aims to level the playing field.

Data showing information related to AI strategy status in OECD countries

Benchmarks

Computation as a National Resource: An effort to estimate computing capacity for 37 nations.

How much processing power do various nations have on hand to drive their AI strategy? An international trade group aims to find out. The Organisation for Economic Co-operation and Development (OECD) is launching an effort to measure the computing capacity available in countries around the world.

Animations depicting benchmarking, datasets and best practices

Benchmarks

Prosperity of the Commons: Tools from MLCommons for improved model development

A new consortium of companies, schools, and research labs is building open tools for next-generation machine learning. MLCommons aims to foster innovation in machine learning by developing new benchmarks, datasets, and best practices.

Screen captures of online platform Dynabench

Benchmarks

Dynamic Benchmarks: A platform for fooling language models

Benchmarks provide a scientific basis for evaluating model performance, but they don’t necessarily map well to human cognitive abilities. Facebook aims to close the gap through a dynamic benchmarking method that keeps humans in the loop.

Bert (muppet) and information related to BERT (transformer-based machine learning technique)

Benchmarks

Do Muppets Have Common Sense?: The Bert NLP model scores high on common sense test.

Two years after it pointed a new direction for language models, Bert still hovers near the top of several natural language processing leaderboards. A new study considers whether Bert simply excels at tracking word order or or learns something closer to common sense.

Benchmarks

Image Generators in the Arena: Text-to-image generators face off in arena leaderboard by Artificial Analysis

Challenging Human-Level Models: Hugging Face overhauls open LLM leaderboard with tougher benchmarks

Benchmarks for Agentic Behaviors: New LLM benchmarks for Tool Use and Planning in workplace tasks

Private Benchmarks for Fairer Tests: Scale AI launches SEAL leaderboards to benchmark model performance

Benchmarks for Industry: Vals AI evaluates large language models on industry-specific tasks.

Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots

When Trees Outdo Neural Networks: Decision Trees Perform Best on Most Tabular Data

Humanized Training for Robot Arms: New Research Improves Robot Performance and Adaptability

Toward Next-Gen Language Models: New Benchmarks Test the Limits of Large Language Models

AI Progress Report: Stanford University's fifth annual AI Report for 2022

Transformer Variants Head to Head: A benchmark for comparing different AI transformers.

Computation as a National Resource: An effort to estimate computing capacity for 37 nations.

Prosperity of the Commons: Tools from MLCommons for improved model development

Dynamic Benchmarks: A platform for fooling language models

Do Muppets Have Common Sense?: The Bert NLP model scores high on common sense test.

Subscribe to The Batch