Private Benchmarks for Fairer Tests Scale AI launches SEAL leaderboards to benchmark model performance

Published
Jun 19, 2024
Reading time
3 min read
Safety, Evaluations and Alignment Lab (SEAL) Leaderboards.

Scale AI offers new leaderboards based on its own benchmarks.

What’s new: Scale AI, which helps companies prepare and manage training data, introduced the Safety, Evaluations and Alignment Lab (SEAL) Leaderboards. Four leaderboards test models’ abilities to (i) generate code, (ii) work on Spanish-language inputs and outputs, (iii) follow detailed instructions, and (iv) solve fifth-grade math problems. The company currently tests 11 models from Anthropic, Google, Meta, Mistral, and OpenAI. Developers who want to have their model ranked can contact Scale AI via email.

How it works: The leaderboards track performance on proprietary datasets of roughly 1,000 examples. In all but the math tests, models to be evaluated are grouped and pitted against each other. Each pair receives 50 prompts at a time. Human annotators evaluate the models’ responses and grade which was superior and by how much. Then the models receive another 50 prompts. Models are ranked using a variation on Elo, which scores competitors relative to each other. To keep the test sets from leaking, a given model will be tested only once except in “exceptional cases” where Scale AI believes the risk of overfitting is low. 

  • The coding evaluation tests models’ abilities to generate code, analyze code, fix errors, and solve problems in SQL, Python, Java, JavaScript, HTML, CSS, C++, C, and C#. Annotators judge the code based on correctness, efficiency, readability, adherence to the prompt, and overall quality. 
  • The Spanish dataset tests the ability to respond to prompts written in European and Latin American Spanish, covering both general and cultural subject matter. Annotators evaluate the responses on 16 criteria including style, correctness, harmfulness, and internal contradiction. (The company plans to extend its multilingual evaluation to other languages.)
  • Instruction Following asks models to fulfill detailed, multi-step instructions in a single response. The dataset includes prompts that ask a model to generate poetry, fiction, social posts, or responses playing a particular role. Annotators evaluate the responses using 12 criteria, including how well they reflect the prompt and how useful they are. They rate how well each model followed the instructions and how well they performed relative to each other.
  • The Math leaderboard evaluates models on Scale AI’s GSM1k benchmark of fifth-grade arithmetic and algebra problems written in English. Unlike the other three tests, it tests whether responses are correct rather than pitting models against one another.

Results: As of this writing, GPT-4 Turbo tops the Coding leaderboard with GPT-4o a very close second. GPT-4o tops the Spanish and Instruction Following leaderboards, just ahead of Gemini 1.5 Pro in Spanish and GPT-4 Turbo in Instruction Following. On the Math leaderboard, Claude 3 Opus holds a narrow lead over GPT-4 Turbo (second) and GPT-4o (third).

Behind the news: As more models are trained on data scraped from the web, leakage of test data into training sets has made it more difficult to evaluate their performance on common benchmarks. Earlier this year, researchers at Shanghai Jiao Tong University evaluated 31 open-source large language models and found that several had a high probability of inaccurate benchmark results due to data leakage. Scale AI built the GSM1k math dataset partly to show that some high-profile language models show evidence of overfitting to the common math benchmark GSM8k. 

Why it matters: Traditionally, benchmarks have been open source efforts. But proprietary benchmarks are emerging to help developers evaluate their models and applications with greater confidence. By keeping their datasets under wraps, companies like Scale AI and Vals AI ensure that models haven’t been exposed to test questions and answers previously, making evaluations more reliable. However, private benchmarks lack the transparency of their open counterparts. A mix of public, private, and internal evals may be necessary to get a well rounded picture of a given model’s capabilities.

We’re thinking: We welcome Scale AI’s contribution to the important field of evals, which also includes open benchmarks, LMSYS Chatbot Arena, and HELM

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox