Transparency for Training Data

Published

Apr 24, 2019

Reading time

1 min read

AI is only as good as the data it trains on, but there’s no easy way to assess training data’s quality and character. Researchers want to put that information into a standardized form.

What’s new: Timnit Gebru, Jamie Morgenstern, Briana Vecchione, and others propose a spec sheet to accompany AI resources. They call it "datasheets for datasets."

How it works: Anyone offering a data set, pre-trained model, or AI platform could fill out the proposed form describing:

motivation for generating the data
composition of the data set
maintenance issues
legal and ethical issues
demographics and consent of any people involved

Why It matters: Data collected from the real world tends to embody real-world biases, leading AI to make biased predictions. And data sets that don’t represent real-world variety can lead to overfitting. A reliable description of what’s in the training data could help engineers avoid problems like these.

Bottom line: We live in a world of open APIs, pre-trained models, and off-the-shelf data sets. Users need to know what’s in them. Standardized spec sheets would give them a clearer view.

Subscribe to The Batch