Toward Open-Domain Chatbots Meena Scores High on System for Grading NLP Chatbots

Published

May 27, 2020

Reading time

2 min read

Progress in language models is spawning a new breed of chatbots and, unlike their narrow-domain forebears, they have the gift of gab. Recent research tests the limits of conversational AI.

What’s new: Daniel Adiwardana and collaborators at Google Brain propose a human-scored measure, Sensibleness and Specificity Average (SSA), to rate chatbots on important qualities of human dialog. They also offer Meena, a chatbot optimized for open-domain, multi-turn conversation that scores well on the new metric.

Key insight: Sensibleness (whether a statement makes logical and contextual sense) and specificity (how specific it is within the established context) are good indicators of performance in general conversation. While these criteria don’t lend themselves to gradient calculations, an existing loss function can serve as a proxy.

How it works: Meena is a sequence-to-sequence model with an evolved transformer architecture. It comprises 2.6 billion parameters — a large number only a few months ago, lately overshadowed by ever larger models of up to 17 billion parameters.

The researchers trained the bot on 867 million (context, response) pairs gathered from social media conversations.
Provided a context, Meena learned to predict the actual response using perplexity, a measure of a language model’s predictive ability, as its loss function.
To avoid generating repetitive responses, the model builds multiple candidate responses and uses a classifier to select the best one. The researchers use a sample-and-rank approach to generate a fixed number of independent responses. A user-defined parameter controls the rarity of tokens selected.

Results: The researchers compared Meena, DialoGPT, Cleverbot, Mitsuku and XiaoIce. For each bot, they scored the SSA of both output transcripts and real-time conversational experiences. Meena showed considerably better performance, 79 percent versus the next-best score of 56 percent. The SSA scores of variously sized Meena implementations correlated with their scores on both human-likeness and perplexity.

Why it matters: We’re all for better chatbots, and we’re especially charmed by Meena’s higher-education pun, “Horses go to Hayvard” (see animation above). But this work’s broader contribution is a way to compare chatbot performance and track improvements in conversational ability.

Yes, but: SSA may not top every chatbot designer’s list of criteria. Google, with its mission to organize the world’s information, emphasizes sensibleness and specificity. But Facebook, whose business is built on friendly interactions that may be whimsical, emotional, or disjunct, is aiming for a different target (see “Big Bot Makes Small Talk” below).

We’re thinking: Even imperfect metrics — like the much-criticized but widely used BLEU score for natural language processing — give researchers a clear target and accelerate progress.

Subscribe to The Batch