A new online tool ranks chatbots by pitting them against each other in head-to-head competitions.

What’s new: Chatbot Arena allows users to prompt two large language models simultaneously and identify the one that delivers the best responses. The result is a leaderboard that includes both open source and proprietary models.

How it works: When a user enters a prompt, two separate models generate their responses side-by-side. The user can pick a winner, declare a tie, rule that both responses were bad, or continue to evaluate by entering a new prompt.

  • Chatbot Arena offers two modes: battle and side-by-side. Battle mode includes both open source and proprietary models but identifies them only after a winner has been chosen. Side-by-side mode lets users select from a list of 16 open source models.
  • The system aggregates these competitions and ranks models according to the metric known as Elo, which rates competitors relative to one another. Elo has no maximum or minimum score. A model that scores 100 points more than an opponent is expected to win 64 percent of matches against it, and a model that scores 200 points more is expected to win 76 percent of matches.

Who’s ahead?: As of July 19, 2023, OpenAI’s GPT-4 topped the leaderboard. Two versions of Anthropic’s Claude rank second and third. GPT-3.5-turbo holds fourth place followed by two versions of Vicuna (LLaMA fine-tuned on shared ChatGPT conversations).

Why it matters: Typical language benchmarks assess model performance quantitatively. Chatbot Arena provides a qualitative score, implemented in a way that can rank any number of models relative to one another.

We’re thinking: In a boxing match between GPT-4 and the 1960s-vintage ELIZA, we’d bet on ELIZA. After all, it used punch cards.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox