Reading time
2 min read
Dozens of different faces shown in a series of images

Datasets for training face recognition models have ballooned in size — while slipping in quality and respect for privacy.

What’s new: In a survey of 130 datasets compiled over the last four decades, Mozilla fellow Inioluwa Deborah Raji and AI consultant Genevieve Fried traced how the need for increasing quantities of data led researchers to relax their standards. The result: datasets riddled with blurred photos, biased labels, and images of minors, collected and used without permission, the authors told MIT Technology Review.

What they found: The study divides the history of face datasets into four periods.

  • Starting in 1964, face images were captured in photo shoots using paid models and controlled lighting. Gathering these datasets was expensive and time-consuming; the biggest comprised 7,900 images.
  • The U.S. Department of Defense kicked off the second period in 1996 by spending $6.5 million to develop FERET, which contained 14,126 images of 1,200 individuals. Like most other datasets of this era, it was compiled from photo shoots with consenting subjects. Models trained on these datasets faltered in the real world partly due to their relatively homogenous lighting and poses.
  • Released in 2007, Labeled Faces in the Wild was the first face dataset scraped from the web. LFW’s 13,000 images included varied lighting conditions, poses, and facial expressions. Other large datasets were gathered from Google, Flickr, and Yahoo as well as mugshots and surveillance footage.
  • In 2014, Facebook introduced DeepFace, the first face recognition model that used deep learning, which identified people with unprecedented accuracy. Researchers collected tens of millions of images to take advantage of this data-intensive approach. Obtaining consent for every example became impossible, as did ensuring that each one’s label was accurate and unbiased.

Why it matters: People deserve to be treated fairly and respectfully by algorithms as well as other people. Moreover, datasets assembled without due attention to permission and data quality erode the public’s trust in machine learning. Companies like and FindFace stand accused of harvesting online images without consent and using them in ways that violate individuals’ privacy, while shaky algorithms have contributed to biased policing. In the U.K., Canada, and certain U.S. jurisdictions, lawmakers and lawsuits are calling for restrictions on the use of face images without consent.

We’re thinking: Andrew and his teams have worked on many face recognition systems over the years. Our practices have evolved — and continue to do so — as both society and AI practitioners have come to recognize the importance of privacy. As we gather data, we must also work toward fairer and more respectful standards governing its collection, documentation, and use.

Fun fact: Andrew’s face appears (with permission!) in a Carnegie Mellon University face dataset collected by Tom Mitchell in 1996. Here’s what Andrew looked like in those days.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox