Transparency for Training Data

Reading time
1 min read
Datasheets for datasets

AI is only as good as the data it trains on, but there’s no easy way to assess training data’s quality and character. Researchers want to put that information into a standardized form.

What’s new: Timnit Gebru, Jamie Morgenstern, Briana Vecchione, and others propose a spec sheet to accompany AI resources. They call it "datasheets for datasets."

How it works: Anyone offering a data set, pre-trained model, or AI platform could fill out the proposed form describing:

  • motivation for generating the data
  • composition of the data set
  • maintenance issues
  • legal and ethical issues
  • demographics and consent of any people involved

Why It matters: Data collected from the real world tends to embody real-world biases, leading AI to make biased predictions. And data sets that don’t represent real-world variety can lead to overfitting. A reliable description of what’s in the training data could help engineers avoid problems like these.

Bottom line: We live in a world of open APIs, pre-trained models, and off-the-shelf data sets. Users need to know what’s in them. Standardized spec sheets would give them a clearer view.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox