Toward Next-Gen Language Models New Benchmarks Test the Limits of Large Language Models

Published
Reading time
2 min read
Word cloud, chess positions given to the model as text and chart with % of suggested chess moves

A new benchmark aims to raise the bar for large language models.

What’s new: Researchers at 132 institutions worldwide introduced the Beyond the Imitation Game benchmark (BIG-bench), which includes tasks that humans perform well but current state-of-the-art models don’t.

How it works: The authors selected over 200 tasks based on 10 criteria such as being sensible to humans, not solved by current language models, and “not solvable by memorizing the internet.” Many involve atypical problems such as identifying a single move that will win a game of chess, guessing a movie title from a series of emojis, and playing a role in a mock courtroom trial.

  • The tasks are zero- or few-shot, meaning that a model is given a small number of example prompt-and-response pairs and expected to respond to a novel prompt. (In this way, BIG-bench is used to test models, not to fine-tune them.)
  • The authors ran the tasks on various sizes of OpenAI’s GPT-3, Google’s PaLM, and dense and sparse varieties of Google’s BIG-G (based on LaMDA).
  • They also posed the tasks to a team of humans, who were allowed to search the internet as they performed the tasks.

Results: No model, regardless of size, outperformed the best-performing human on any task. However, for some tasks, the best-performing model beat the average human. For example, answering multiple-choice questions about Hindu mythology, the best model scored around 76 percent, the average human scored roughly 61 percent, and the best human scored 100 percent (random chance was 25 percent). Generally, larger models performed better than smaller ones. For example, BIG-G’s average accuracy on three-shot, multiple-choice tasks was nearly 33 percent with a few million parameters but around 42 percent with over a hundred billion parameters.

Why it matters: BIG-bench’s creators argue that benchmarks like SuperGLUE, SQuAD2.0, and GSM8K focus on narrow skills. Yet the latest language models, after pretraining on huge datasets scraped from the internet, show unexpected abilities such as solving simple arithmetic problems. BIG-bench’s diverse, few-shot tasks give researchers new ways to track such emergent capabilities as models, data, and training methods evolve.

We’re thinking: Devising tasks that can’t be solved by memorizing the internet may push researchers to develop algorithms — including ones that enable complex forms of reasoning — that generalize well even with limited amounts of training data.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox