A Kinder, Gentler Language Model Inside Instruct GPT-3, OpenAI's GPT-3 successor.

Reading time
2 min read
InstructGPT methods

OpenAI unveiled a more reliable successor to its GPT-3 natural language model.

What’s new: InstructGPT is a version of GPT-3 fine-tuned to minimize harmful, untruthful, and biased output. It's available via an application programming interface at a cost between $0.0008 and $0.06 per thousand tokens depending on speed and capability.

How it works: The developers improved the quality of GPT-3’s output using a combination of supervised learning and reinforcement learning from human feedback, in which humans rank a model’s potential outputs and a reinforcement learning algorithm rewards the model for producing material similar to high-ranking outputs.

  • The training dataset began with prompts created by hired contractors, some of them based on input by GPT-3 users, such as “tell me a story about a frog” or “explain the moon landing to a six-year-old in a few sentences.” The developers split the prompts into three parts and created responses in different ways for each part.
  • Human writers wrote responses to the first set of prompts. The developers fine-tuned a pretrained GPT-3, which would become InstructGPT, to generate the existing response to each prompt.
  • The next step was to train a model to generate higher rewards for better responses. Given the second set of prompts, the fine-tuned model generated multiple responses. Human raters ranked each response. Given a prompt and two responses, a reward model (another pre-trained GPT-3) learned to compute a higher reward for the higher-rated response and a lower reward for the other.
  • The developers used the third set of prompts to further fine-tune the language model using the reinforcement learning method Proximal Policy Optimization (PPO). Given a prompt, the language model generated a response, and the reward model granted a commensurate reward. PPO used the rewards to update the language model.

Results: InstructGPT outperformed GPT-3 on TruthfulQA, which tests how often a model generates falsehoods, 0.196 to 0.233 (lower is better). It also beat GPT-3 on RealToxicityPrompts, which tests a model’s propensity to produce toxic language, 0.413 to 0.224 (higher is better). Contractors rated InstructGPT’s output higher-quality than GPT-3’s, despite the former model only having 1.3 billion parameters — 100 times fewer than GPT-3’s 175 billion parameters.

Behind the news: GPT-3’s training dataset — in particular, massive quantities of text scraped from the web — has been linked output that stereotype certain social groups, denigrate women, and encourage self-harm. OpenAI previously tried to detoxify GPT-3 by fine-tuning it on PALMS, a dataset curated according to measures of human rights and human equality.

Why it matters: OpenAI’s language models have powered educational tools, virtual therapists, writing aids, role-playing games, and much more. Social biases, misinformation, and toxicity in such contexts are unhelpful at best, harmful at worst. A system that avoids such flaws is likely to be both less dangerous and more useful.

We’re thinking: Makers of foundation models, general-purpose models that can be fine-tuned for specialized applications, have a special responsibility to make sure their work doesn’t contain flaws that proliferate in fine-tuned versions. OpenAI’s ongoing effort to improve GPT-3 is a hopeful sign that the AI industry can manage such models responsibly.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox