Chatbot Cage Match Chatbot Arena compares chatbots side-by-side.

Published

Jul 19, 2023

Reading time

1 min read

A new online tool ranks chatbots by pitting them against each other in head-to-head competitions.

What’s new: Chatbot Arena allows users to prompt two large language models simultaneously and identify the one that delivers the best responses. The result is a leaderboard that includes both open source and proprietary models.

How it works: When a user enters a prompt, two separate models generate their responses side-by-side. The user can pick a winner, declare a tie, rule that both responses were bad, or continue to evaluate by entering a new prompt.

Chatbot Arena offers two modes: battle and side-by-side. Battle mode includes both open source and proprietary models but identifies them only after a winner has been chosen. Side-by-side mode lets users select from a list of 16 open source models.
The system aggregates these competitions and ranks models according to the metric known as Elo, which rates competitors relative to one another. Elo has no maximum or minimum score. A model that scores 100 points more than an opponent is expected to win 64 percent of matches against it, and a model that scores 200 points more is expected to win 76 percent of matches.

Who’s ahead?: As of July 19, 2023, OpenAI’s GPT-4 topped the leaderboard. Two versions of Anthropic’s Claude rank second and third. GPT-3.5-turbo holds fourth place followed by two versions of Vicuna (LLaMA fine-tuned on shared ChatGPT conversations).

Why it matters: Typical language benchmarks assess model performance quantitatively. Chatbot Arena provides a qualitative score, implemented in a way that can rank any number of models relative to one another.

We’re thinking: In a boxing match between GPT-4 and the 1960s-vintage ELIZA, we’d bet on ELIZA. After all, it used punch cards.

Subscribe to The Batch