Abeba Birhane Clean up web datasets

Published

Dec 29, 2021

Reading time

3 min read

From language to vision models, deep neural networks are marked by improved performance, higher efficiency, and better generalizations. Yet, these systems are also marked by perpetuation of bias and injustice, inaccurate and stereotypical representation of groups, lack of explainability and brittleness. I am optimistic that we will move slowly toward building more equitable AI, thanks to critical scholars who have been calling for caution and foresight. I hope we can adopt measures that mitigate these impacts as a routine part of building and deploying AI models.

The field does not lack optimism. In fact, everywhere you look, you find overenthusiasm, overpromising, overselling, and exaggeration of what AI models are capable of doing. Mainstream media outlets aren’t the only parties guilty of making unsustainable claims, overselling capabilities, and using misleading language; AI researchers themselves do it, too.

Language models, for example, are given human-like attributes such as “awareness” and “understanding” of language, when in fact models that generate text simply predict the next word in a sequence based on the previous words, with no understanding of underlying meaning. We won't be able to foresee the impact our models have on the lives of real people if we don't see the models themselves clearly. Acknowledging their limitations is the first step toward addressing the potential harms they are likely to cause.

What is more concerning is the disregard towards work that examines datasets. As models get bigger and bigger, so do datasets. Models with a trillion parameters require massive training and testing datasets, often sourced from the web. Without the active work of auditing, carefully curating, and improving such datasets, data sourced from the web is like a toxic waste. Web-sourced data plays a critical role in the success of models, yet critical examination of large-scale datasets is underfunded, and underappreciated. Past work highlighting such issues is marginalized and undervalued. Scholars such as Deborah Raji, Timnit Gebru, and Joy Buolamwini have been at the forefront of doing the dirty and tiresome work and cleaning up the mess. Their insights should be applied at the core of model development. Otherwise, we stand to build models that reflect the lowest common denominators of human expression: cruelty, bigotry, hostility, and deceit.

My own work has highlighted troubling content — from misogynistic and racial slurs to malignant stereotypical representations of groups — found in large-scale image datasets such as TinyImages and ImageNet. One of the most distressing things I have ever had to do as a researcher was to sift through LAION-400M, the largest open-access multimodal dataset to date. Each time I queried the dataset with a term that was remotely related to Black women, it produced explicit and dehumanizing images from pornographic websites.

Such work needs appropriate allocations of time, talent, funding and resources. Moreover, it requires support for the people who must do this work. It causes deep, draining emotional and psychological trauma. The researchers who do this work — especially people of color who are often in precarious positions — deserve pay commensurate to their contribution as well as access to counseling to help them cope with the experience of sifting through what can be horrifying, degrading material.

The nascent work in this area so far — and the acknowledgement, however limited, that it has received — fills me with hope in the coming year. Instead of blind faith in models and overoptimism about AI, let’s pause and appreciate the people who are doing the dirty background work to make datasets, and therefore models, more accurate, just, and equitable. Then, let's move forward — with due caution — toward a future in which the technology we build serves the people who suffer disproportionately negative impacts; in the word of Pratyusha Kalluri, towards technology that shifts power from the most to the least powerful.

My highest hope for AI in 2022 is that this difficult and valuable work — and those who do such work, especially Black women — will become part and parcel of mainstream AI research. These scholars are inspiring the next generation of responsible and equitable AI. Their work is reason not for defeatism or skepticism but for hope and cautious optimism.

Abeba Birhane is a cognitive science PhD researcher at the Complex Software Lab in the school of computer science at University College Dublin.

Subscribe to The Batch