A Kinder, Gentler Language Model Inside Instruct GPT-3, OpenAI's GPT-3 successor.

Published

Feb 02, 2022

Reading time

2 min read

OpenAI unveiled a more reliable successor to its GPT-3 natural language model.

What’s new: InstructGPT is a version of GPT-3 fine-tuned to minimize harmful, untruthful, and biased output. It's available via an application programming interface at a cost between $0.0008 and $0.06 per thousand tokens depending on speed and capability.

How it works: The developers improved the quality of GPT-3’s output using a combination of supervised learning and reinforcement learning from human feedback, in which humans rank a model’s potential outputs and a reinforcement learning algorithm rewards the model for producing material similar to high-ranking outputs.

The training dataset began with prompts created by hired contractors, some of them based on input by GPT-3 users, such as “tell me a story about a frog” or “explain the moon landing to a six-year-old in a few sentences.” The developers split the prompts into three parts and created responses in different ways for each part.
Human writers wrote responses to the first set of prompts. The developers fine-tuned a pretrained GPT-3, which would become InstructGPT, to generate the existing response to each prompt.
The next step was to train a model to generate higher rewards for better responses. Given the second set of prompts, the fine-tuned model generated multiple responses. Human raters ranked each response. Given a prompt and two responses, a reward model (another pre-trained GPT-3) learned to compute a higher reward for the higher-rated response and a lower reward for the other.
The developers used the third set of prompts to further fine-tune the language model using the reinforcement learning method Proximal Policy Optimization (PPO). Given a prompt, the language model generated a response, and the reward model granted a commensurate reward. PPO used the rewards to update the language model.

Results: InstructGPT outperformed GPT-3 on TruthfulQA, which tests how often a model generates falsehoods, 0.196 to 0.233 (lower is better). It also beat GPT-3 on RealToxicityPrompts, which tests a model’s propensity to produce toxic language, 0.413 to 0.224 (higher is better). Contractors rated InstructGPT’s output higher-quality than GPT-3’s, despite the former model only having 1.3 billion parameters — 100 times fewer than GPT-3’s 175 billion parameters.

Behind the news: GPT-3’s training dataset — in particular, massive quantities of text scraped from the web — has been linked output that stereotype certain social groups, denigrate women, and encourage self-harm. OpenAI previously tried to detoxify GPT-3 by fine-tuning it on PALMS, a dataset curated according to measures of human rights and human equality.

Why it matters: OpenAI’s language models have powered educational tools, virtual therapists, writing aids, role-playing games, and much more. Social biases, misinformation, and toxicity in such contexts are unhelpful at best, harmful at worst. A system that avoids such flaws is likely to be both less dangerous and more useful.

We’re thinking: Makers of foundation models, general-purpose models that can be fine-tuned for specialized applications, have a special responsibility to make sure their work doesn’t contain flaws that proliferate in fine-tuned versions. OpenAI’s ongoing effort to improve GPT-3 is a hopeful sign that the AI industry can manage such models responsibly.

Subscribe to The Batch