Datasets

43 Posts

Data Does Not Want to Be Free: Reddit and Stack Overflow ask AI devs to pay for data.
Datasets

Data Does Not Want to Be Free: Reddit and Stack Overflow ask AI devs to pay for data.

Developers of language models will have to pay for access to troves of text data that they previously got for free. The discussion platform Reddit and question-and-answer site Stack Overflow announced plans to protect their data from being used to train large language models.
Different illustration showing the application of PCA to color populations
Datasets

PCA Raises Red Flags: Principal component analysis can negatively impact science.

Principal component analysis is a key machine learning technique for reducing the number of dimensions in a dataset, but new research shows that its output can be inconsistent and unreliable.
Graph with difference in test error in keeping hard versus easy examples
Datasets

Unsupervised Data Pruning: New method removes useless machine learning data.

Large datasets often contain overly similar examples that consume training cycles without contributing to learning. A new paper identifies similar training examples, even if they’re not labeled.
Dataset FOLIO example based on the Wild Turkey Wikipedia page
Datasets

Language Models Defy Logic: Large NLP models struggle with logical reasoning.

Who would disagree that, if all people are mortal and Socrates is a person, Socrates must be mortal? GPT-3, for one. Recent work shows that bigger language models are not necessarily better when it comes to logical reasoning.
3 graphs showing projections of data usage. Each one shows two extrapolations of data usage.
Datasets

Will We Have Enough Data?

The world’s supply of data soon may fail to meet the demands of increasingly hungry machine learning models. Researchers at Epoch AI found that a shortage of text data could cause trouble as early as this year. Vision data may fall short within a decade.
Flowcharts show how a new contrastive learning approach uses metadata to improve AI image classifiers
Datasets

Learning From Metadata: Descriptive Text Improves Performance for AI Image Classification Systems

Images in the wild may not come with labels, but they often include metadata. A new training method takes advantage of this information to improve contrastive learning.
Yoav Shoham
Datasets

Yoav Shoham: Language models that reason

I believe that natural language processing in 2022 will re-embrace symbolic reasoning, harmonizing it with the statistical operation of modern neural networks. Let me explain what I mean by this.
Chip Huyen
Datasets

Chip Huyen: AI that adapts to changing conditions

Until recently, big data processing has been dominated by batch systems like MapReduce and Spark, which allow us to periodically process a large amount of data very efficiently.
Abeba Birhane
Datasets

Abeba Birhane: Clean up web datasets

From language to vision models, deep neural networks are marked by improved performance, higher efficiency, and better generalizations. Yet, these systems are also marked by perpetuation of bias and injustice.
Halloween family portrait showing the inheritance of some spooky characteristics
Datasets

New Models Inherit Old Flaws: AI Models May Inherit Flaws From Previous Systems

Is AI becoming inbred? The fear: The best models increasingly are fine-tuned versions of a small number of so-called foundation models that were pretrained on immense quantities of data scraped from the web.
Series of example of accurate and inaccurate matching images to text
Datasets

Crawl the Web, Absorb the Bias: NLP Models Absorb Biases from Web Training Data

The emerging generation of trillion-parameter models needs datasets of billions of examples, but the most readily available source of examples on that scale — the web — is polluted with bias and antisocial expressions. A new study examines the issue.
Animation showing AlphaFold working
Datasets

Biomedical Treasure Chest: DeepMind open sources AlphaFold and protein databases.

DeepMind opened access to AlphaFold, a model that finds the shapes of proteins, and to its output so far — a potential cornucopia for biomedical research. The research lab, a division of Google’s parent company Alphabet, made AlphaFold freely available.
Walking through a narrow hallway in a library
Datasets

Bias By the Book: Researchers find bias in influential NLP dataset BookCorpus.

Researchers found serious flaws in an influential language dataset, highlighting the need for better documentation of data used in machine learning.
Timeline for biomedical AI projects
Datasets

Boosting Biomedicine: The NIH's Bridge2AI program will fund new health datasets.

The U.S. government aims to turbocharge biomedical AI research. The National Institutes of Health, which invests $41.7 billion annually in medical research, announced a program called Bridge to Artificial Intelligence (Bridge2AI) to promote machine learning in human biology and medicine.
Model identifying erroneous labels in popular datasets
Datasets

Labeling Errors Everywhere: Many deep learning datasets contain mislabeled data.

Key machine learning datasets are riddled with mistakes. Several benchmark datasets are shot through with incorrect labels. On average, 3.4 percent of examples in 10 commonly used datasets are mislabeled and the detrimental impact of such errors rises with model size.

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox