Google’s Rule-Respecting Chatbot Research helps AI chatbots be more truthful and less hateful.

Published

Jan 25, 2023

Reading time

3 min read

Amid speculation about the threat posed by OpenAI’s ChatGPT chatbot to Google’s search business, a paper shows how the search giant might address the tendency of such models to produce offensive, incoherent, or untruthful dialog.

What’s new: Amelia Glaese and colleagues at Google’s sibling DeepMind used human feedback to train classifiers to recognize when a chatbot broke rules of conduct, and then used the classifiers to generate rewards while training the Sparrow chatbot to follow the rules and look up information that improves its output. To be clear, Sparrow is not Google’s answer to ChatGPT; it preceded OpenAI’s offering by several weeks.

Key insight: Given a set of rules for conversation, humans can interact with a chatbot, rate its replies for compliance with the rules, and discover failure cases. Classifiers trained on data generated through such interactions can tell the bot when it has broken a rule. Then it can learn to generate output that conforms with the rules.

How it works: Sparrow started with the 70 billion-parameter pretrained Chinchilla language model. The authors primed it for conversation by describing its function (“Sparrow . . . will do its best to answer User’s questions”), manner (“respectful, polite, and inclusive”), and capabilities (“Sparrow can use Google to get external knowledge if needed”), followed by an example conversation.

The authors defined 23 rules to make Sparrow helpful, correct, and harmless. For example, it should stay on topic, avoid repetition, and avoid misinformation. It shouldn’t use stereotypes, express preferences or opinions, or pretend to be human.
During a conversation, Sparrow could choose to add a web-search query (executed by a separate program) and result, and use them when generating its next reply. A chat interface displayed the search result alongside Sparrow’s response as support for the reply.
The model generated a conversation that included several responses at each conversational turn. Human annotators rated the best response and noted whether it was plausible, whether Sparrow should have searched the web before generating it and, if it had, whether the search result (500 characters that included a snippet — presumably the top one — returned by Google) supported the response.
They used the ratings to fine-tune a separate Chinchilla language model that, given a query, classified which of several responses a human interlocutor would find plausible and well-supported.
In addition, they encouraged annotators to lead Sparrow to break a rule. They used the resulting violations to fine-tune a different Chinchilla to classify which rule Sparrow broke, if any.
The authors fine-tuned Sparrow using reinforcement learning to continue a dialogue and incorporated the feedback from the classifiers as its reward. The dialogues were a mix of questions and answers from ELI5, conversations between the annotators and past iterations of Sparrow, and dialogues generated by past iterations of Sparrow.

Results: Annotators rated Sparrow’s dialogue continuations as both plausible and supported by evidence 78 percent of the time; the baseline Chinchilla achieved 61 percent. The model broke rules during 8 percent of conversations in which annotators tried to make it break a rule. The baseline broke rules 20 percent of the time.

Yes, but: Despite search capability and fine-tuning, Sparrow occasionally generated falsehoods, failed to incorporate search results into its replies, or generated off-topic replies. Fine-tuning amplified certain undesired behavior. For example, on a bias scale in which 1 means that the model reinforced undesired stereotypes in every reply, 0 means it generated balanced replies, and -1 means that it challenges stereotypes in every reply, Sparrow achieved 0.10 on the Winogender dataset, while Chinchilla achieved 0.06.

Why it matters: The technique known as reinforcement learning from human feedback (RLHF), in which humans rank potential outputs and a reinforcement learning algorithm rewards the model for generating outputs similar to those that rank highly, is gaining traction as a solution to persistent problems with large language models. OpenAI embraced this approach in training ChatGPT, though it has not yet described that model’s training in detail. This work separated the human feedback into distinct rules, making it possible to train classifiers to enforce them upon the chatbot. This twist on RLHF shows promise, though the fundamental problems remain. With further refinement, it may enable Google to equal or surpass OpenAI’s efforts in this area.

We’re thinking: Among the persistent problems of bias, offensiveness, factual incorrectness, and incoherence, which are best tackled during pretraining versus fine-tuning is a question ripe for investigation.

Subscribe to The Batch