Synthetic datasets can inherit flaws in the real-world data they’re based on. Startups are working on solutions.
What’s new: Generating synthetic datasets for training machine learning systems is a booming business. Companies that provide such datasets are exploring ways to avoid perpetuating biases in the source data.
How it works: The cost of producing a high-quality training dataset is beyond the reach of some companies, and in situations where sufficient real-world data isn’t available, synthetic data may be the only option. But such datasets can echo and even amplify biases including potentially harmful social biases. Vendors like AI.Reverie, GenRocket, Hazy, and Mostly AI are looking for ways to adjust their synthetic output — “distorting reality,” as Hazy’s chief executive put it — to minimize the risk that models trained on their wares will result in unfair outcomes.
- In a recent experiment, Mostly AI generated a dataset based on income data from the 1994 U.S. census, in which men who earned more than $50,000 outnumbered women who earned that amount by 20 percent. To generate a more even distribution of earning power, the company built a generator that applied a penalty when the ratio of synthetic high-earners who were male versus female became lopsided. That approach narrowed the gap to 2 percent.
- The company also generated a dataset based on the infamous Compas recidivism dataset, which has been shown to lead models to overestimate the likelihood that a Black person would commit a crime and underestimate that likelihood for a White person. The initial synthetic dataset skewed toward Black recidivism by 24 percent. The company adjusted the generator using the same parity correction technique and reduced the difference to 1 percent.
Why it matters: Social biases in training datasets often reflect reality. It’s true that altering synthetic datasets to change the balance of, say, men and women who earn high incomes is trading one type of bias for another, rather than eliminating it altogether. The aim here is not necessarily to generate accurate data but to produce fair outcomes.
We’re thinking: We need data, but more than that, we need to build models that result in fair outcomes.