What We Know — and Don’t Know — About Foundation Models A new Stanford index to assess the transparency of leading AI models

Published

Nov 01, 2023

Reading time

3 min read

A new index ranks popular AI models in terms of information their developers provide about their training, architecture, and usage. Few score well.

What’s new: The Stanford Center for Research on Foundation Models published its debut Foundation Model Transparency Index, scoring 10 popular models on how well their makers disclosed details of their training, characteristics, and use.

How it works: Rishi Bommasani, Kevin Klyman, and colleagues at Stanford, MIT, and Princeton examined 10 foundation models — that is, models that can be pretrained for general purposes and fine-tuned for specific tasks — from 10 companies. They scored each model by asking 100 yes-or-no questions that covered training, model architecture and behavior, and policies regarding access and usage.

Training: Roughly one-third of the questions asked questions related to training, like whether factors like processing, hardware, and training data used to build the model are disclosed. They also asked whether external parties have access to the dataset and whether steps were taken to protect data privacy or intellectual property.
Architecture and behavior: Around one-third of the questions enquired about the trained model, such as whether a developer disclosed details about a model’s architecture, capabilities, and limitations. They also asked whether independent researchers were able to test the model and evaluate its risks and trustworthiness.
Access and usage: The final third of the questions asked about how the model can be used, including whether the model is available to all prospective users, whether restrictions apply to such uses, and whether use requires an explicit license. They also gauged whether users are notified that they’re interacting with an AI model, whether user data is stored, whether a log of versions is provided, and whether a list of applications based on the model is available.

Results: The index assigned each model a score between 1 and 100. Meta’s Llama 2 ranked most transparent with a score of 54. BigScience’s BLOOM-Z came in just behind with a score of 53. At the bottom of the list were Inflection’s Inflection-1, which scored 21, and Amazon’s Titan Text, which scored 12.

Three of the four highest-scoring models — Llama 2, BLOOMZ, and Stability.AI’s Stable Diffusion 2 — were released with model weights. Meanwhile, the six lowest-scoring models were closed models.
On average, the models showed the greatest transparency with respect to access and usage. They were least transparent with respect to training.
Transparency ratings did not correlate with company size. For instance, the top spots were occupied by Llama 2 from the giant Meta and BLOOMZ from BigScience, a much smaller organization.

Yes, but: Because the index is limited to yes/no questions, it doesn’t allow for partial credit. In addition, the questions are weighted equally, so lack of transparency in an important area (say, access to training data) costs only one point in a model’s overall score. It’s easy to imagine companies gaming the scores rather than addressing the most meaningful deficits.

Behind the news: Researchers at MIT, Cohere For AI, and 11 other organizations recently launched the Data Provenance Platform, a project that audits and categorizes training datasets. The effort offers a Data Provenance Explorer for evaluating sources, licenses, creators, and other metadata with respect to roughly 1,800 text datasets.

Why it matters: AI has a transparency problem, and the rise of models that serve as foundations for other models exacerbates the issue. Without disclosure of fundamental factors like architectures, datasets, and training methods, it’s impossible to replicate research, evaluate cost per performance, and address biases. Without disclosure of applications based on a given foundation model, it’s impossible to weigh those applications’ capabilities and limitations. A consistent set of criteria for evaluating transparency may encourage greater disclosure.

We’re thinking: The rise of open source AI has been accompanied by an opposite rise in commercial concerns that have little incentive to reveal the inner workings of their models. An index encourages everyone to provide detailed information about the systems they build, and we hope it will help engineers who care about transparency to persuade their teammates. We look forward to refinements and expansion to cover models that aren’t included among the initial 10.

Subscribe to The Batch