OpenAI Revamps Safety Protocol Inside OpenAI's framework to evaluate and mitigate model risks

Published

Jan 10, 2024

Reading time

2 min read

Retrenching after its November leadership shakeup, OpenAI unveiled a new framework for evaluating risks posed by its models and deciding whether to limit their use.

What’s new: OpenAI’s safety framework reorganizes pre-existing teams and forms new ones to establish a hierarchy of authority with the company’s board of directors at the top. It defines four categories of risk to be considered in decisions about how to use new models.

How it works: OpenAI’s Preparedness Team is responsible for evaluating models. The Safety Advisory Group, whose members are appointed by the CEO for year-long terms, reviews the Preparedness Team’s work and recommends approaches to deploying models and mitigating risks, if necessary. The CEO has the authority to approve and oversee recommendations, overriding the Safety Authority Group if needed. OpenAI’s board of directors can overrule the CEO.

The Preparedness Team scores each model in four categories of risk: enabling or enhancing cybersecurity threats, helping to create weapons of mass destruction, generating outputs that affect users’ beliefs, and operating autonomously without human supervision. The team can modify these risk categories or add new categories in response to emerging research.
The team scores models in each category using four levels: low, medium, high, or critical. Critical indicates a model with superhuman capabilities or, in the autonomy category, one that can resist efforts to shut it down. A model’s score is its highest risk level in any category.
The team scores each model twice: once after training and fine-tuning, and a second time after developers have tried to mitigate risks.
OpenAI will not release models that earn a score of high or critical prior to mitigation, or a medium, high, or critical after mitigation.

Behind the news: The Preparedness Team and Safety Advisory Group join a number of safety-focused groups within OpenAI. The Safety Systems Team focuses on mitigating risks after a model has been deployed; for instance, ensuring user privacy and preventing language models from providing false information. The Superalignment Team, led by Ilya Sutskever and Jan Leike, is charged with making sure hypothetical superintelligent systems, whose capabilities would surpass humans, adhere to values that benefit humans.

Why it matters: AI is an extraordinarily powerful technology whose ultimate impacts are difficult to foresee. OpenAI has invested consistently in AI safety since its inception — even if purportedly cautious moves like keeping its GPT-2 large language model under wraps often looked as much like publicity stunts as safety measures — and its practices are likely to influence those of other AI companies. Furthermore, OpenAI has faced internal chaos partly over concerns about safety and governance. Clear protocols in these areas could prevent future strife and stabilize the company to the benefit of its users, employees, and investors.

We’re thinking: OpenAI’s safety framework looks like a step forward, but its risk categories focus on long-term, low-likelihood outcomes (though they stop short of considering AI’s hypothetical, and likely mythical, existential risk to humanity). Meanwhile, clear and present safety issues, such as social bias and factual accuracy, are well known to afflict current models including OpenAI’s. We hope that the Preparedness Team promptly adds categories that represent safety issues presented by today’s models.

Subscribe to The Batch