datasets

38 Posts

Flowcharts show how a new contrastive learning approach uses metadata to improve AI image classifiers
datasets

Learning From Metadata: Descriptive Text Improves Performance for AI Image Classification Systems

Images in the wild may not come with labels, but they often include metadata. A new training method takes advantage of this information to improve contrastive learning.
2 min read
Yoav Shoham
datasets

Yoav Shoham: Language Models That Reason

I believe that natural language processing in 2022 will re-embrace symbolic reasoning, harmonizing it with the statistical operation of modern neural networks. Let me explain what I mean by this.
2 min read
Chip Huyen
datasets

Chip Huyen: AI That Adapts to Changing Conditions

Until recently, big data processing has been dominated by batch systems like MapReduce and Spark, which allow us to periodically process a large amount of data very efficiently.
2 min read
Abeba Birhane
datasets

Abeba Birhane: Clean Up Web Datasets

From language to vision models, deep neural networks are marked by improved performance, higher efficiency, and better generalizations. Yet, these systems are also marked by perpetuation of bias and injustice.
3 min read
Halloween family portrait showing the inheritance of some spooky characteristics
datasets

New Models Inherit Old Flaws: AI Models May Inherit Flaws From Previous Systems

Is AI becoming inbred? The fear: The best models increasingly are fine-tuned versions of a small number of so-called foundation models that were pretrained on immense quantities of data scraped from the web.
1 min read
Series of example of accurate and inaccurate matching images to text
datasets

Crawl the Web, Absorb the Bias: Language Models Absorb Biases from Web Training Data

The emerging generation of trillion-parameter models needs datasets of billions of examples, but the most readily available source of examples on that scale — the web — is polluted with bias and antisocial expressions. A new study examines the issue.
2 min read
Animation showing AlphaFold working
datasets

Biomedical Treasure Chest

DeepMind opened access to AlphaFold, a model that finds the shapes of proteins, and to its output so far — a potential cornucopia for biomedical research. The research lab, a division of Google’s parent company Alphabet, made AlphaFold freely available.
2 min read
Walking through a narrow hallway in a library
datasets

Bias By the Book

Researchers found serious flaws in an influential language dataset, highlighting the need for better documentation of data used in machine learning.
1 min read
Timeline for biomedical AI projects
datasets

Boosting Biomedicine

The U.S. government aims to turbocharge biomedical AI research. The National Institutes of Health, which invests $41.7 billion annually in medical research, announced a program called Bridge to Artificial Intelligence (Bridge2AI) to promote machine learning in human biology and medicine.
1 min read
Model identifying erroneous labels in popular datasets
datasets

Labeling Errors Everywhere

Key machine learning datasets are riddled with mistakes. Several benchmark datasets are shot through with incorrect labels. On average, 3.4 percent of examples in 10 commonly used datasets are mislabeled and the detrimental impact of such errors rises with model size.
2 min read
Blurred human faces in different pictures
datasets

De-Facing ImageNet

ImageNet now comes with privacy protection.What’s new: The team that manages the machine learning community’s go-to image dataset blurred all the human faces pictured in it and tested how models trained on the modified images on a variety of image recognition tasks.
2 min read
Square brackets with lines disappearing inside
datasets

Spotlight on Unreproducible Results

A new website calls out AI research that may not lend itself to being reproduced. Papers Without Code maintains a directory of AI systems that researchers tried but failed to reproduce.
2 min read
Graphs and data related to ImageNet performance
datasets

ImageNet Performance: No Panacea

It’s commonly assumed that models pretrained to achieve high performance on ImageNet will perform better on other visual tasks after fine-tuning. But is it always true? A new study reached surprising conclusions.
2 min read
Dozens of different faces shown in a series of images
datasets

Cutting Corners to Recognize Faces

Datasets for training face recognition models have ballooned in size — while slipping in quality and respect for privacy. In a survey of 130 datasets compiled over the last four decades, researchers traced how the need for increasing quantities of data led researchers to relax their standards.
2 min read
Different data related to the phenomenon called underspecification
datasets

Facing Failure to Generalize

The same models trained on the same data may show the same performance in the lab, and yet respond very differently to data they haven’t seen before. New work finds this inconsistency to be pervasive.
2 min read

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox