Unsupervised Data Pruning New method removes useless machine learning data.

Published

Feb 15, 2023

Reading time

2 min read

Large datasets often contain overly similar examples that consume training cycles without contributing to learning. A new paper identifies similar training examples, even if they’re not labeled.

What’s new: Ben Sorscher, Robert Geirhos, and collaborators at Stanford University, University of Tübingen, and Meta proposed an unsupervised method for pruning training data without compromising model performance.

Key insight: A subset of a training dataset that can train a model to perform on par with training on the full corpus is known as a coreset. Previous approaches to selecting a coreset require labeled data. Such methods often train many classification models, study their output, and identify examples that are similar based on how many of the models classified them correctly. Clustering offers an unsupervised alternative that enables a pretrained model to find similar examples in unlabeled data without fine-tuning.

How it works: The authors trained and tested separate ResNets on various pruned versions of datasets both large (ImageNet, 1.2 million examples) and small (CIFAR-10, 60,000 examples). They processed the datasets as follows:

A self-supervised, pretrained SWaV produced a representation of each example.
K-means clustering grouped the representations.
The authors considered an example to be more similar to others (and thus easier to classify correctly) if it was closer to a cluster’s center, and less similar (harder to classify and thus more valuable to training) if it was further away.
They pruned a percentage of more-similar examples, a percentage of less-similar examples, or a random selection.

Results: Tests confirmed the authors’ theory that the optimal pruning strategy depends on dataset size. Pruning CIFAR-10, a ResNet performed better when the authors removed a portion of the most-similar examples than when they removed least-similar examples, up to 70 percent of the entire dataset. In contrast, starting with 10,000 random CIFAR-10 examples, the model achieved better performance when the authors removed any portion of least-similar examples than when they removed the same portion of most-similar examples. On ImageNet, their approach performed close to a state-of-the-art method called memorization, which requires labels. For instance, a ResNet trained on a subset of ImageNet that was missing the most-similar 30 percent of examples achieved 89.4 percent Top-5 accuracy, while using memorization to remove the same percentage of examples yielded nearly the same result. A ResNet trained on a subset of ImageNet that was missing the most-similar 20 percent of examples achieved 90.8 Top-5 accuracy, equal to a ResNet trained on ImageNet pruned to the same degree via memorization and a ResNet trained on ImageNet without pruning.

Why it matters: The authors’ method can cut processing costs during training. If you eliminate examples before hiring people to label the data, it can save labor costs as well.

We’re thinking: By identifying overrepresented portions of the data distribution, data pruning methods like this can also help identify biases during training.

Subscribe to The Batch