Representing the Underrepresented Many important AI datasets contain bias.

Published

Dec 23, 2020

Reading time

2 min read

Some of deep learning’s bedrock datasets came under scrutiny as researchers combed them for built-in biases.

What happened: Researchers found that popular datasets impart biases against socially marginalized groups to trained models due to the ways the datasets were compiled, labeled, and used. Their observations prompted reforms as well as deeper awareness of social bias in every facet of AI.

Driving the story: Image collections were in the spotlight — including ImageNet, the foundational computer-vision dataset.

ImageNet creator Fei-Fei Li and colleagues combed the venerable dataset to remove racist, sexist, and otherwise demeaning labels that were inherited from WordNet, a lexical database dating back to the 1980s.
A study found that even models trained on unlabeled ImageNet data can learn biases that arise from the dataset’s limited human diversity.
MIT Computer Science & Artificial Intelligence Laboratory withdrew the Tiny Images dataset after outside researchers found that it was rife with disparaging labels.
FlickrFaces-HQ (FFHQ), the dataset used to train StyleGAN, apparently also lacks sufficient diversity. This problem emerged when PULSE, a model based on StyleGAN that boosts the resolution of low-res photos, up-rezzed a pixelated image of President Barack Obama, the first Black U.S. president, into a portrait of a White man.

Behind the news: In the wake of the PULSE fiasco, Facebook’s chief AI scientist Yann LeCun and Timnit Gebru, then head of Google’s ethical AI efforts, argued publicly over whether social biases in machine learning originate primarily in faulty datasets or systemic biases within the AI community. LeCun took the position that models aren’t biased until they learn from biased data, and that biased datasets can be fixed. Gebru pointed out — and we agree, as we said in a weekly letter — that such bias arises within a context of social disparities, and that eliminating bias from AI systems requires addressing those disparities throughout the field. Gebru and Google subsequently parted amid further disagreements around bias.

Where things stand: The important work of ensuring that biases in datasets are documented and removed for particular tasks such as generating training data, has only just begun.

Learn more: The Batch in the past year reported on bias mitigation techniques including Double-Hard Debias and Deep Representation Learning on Long-Tailed Data.

Subscribe to The Batch