Will We Have Enough Data?

Published

Jan 04, 2023

Reading time

3 min read

The world’s supply of data soon may fail to meet the demands of increasingly hungry machine learning models.

What’s new: Researchers at Epoch AI found that a shortage of text data could cause trouble as early as this year. Vision data may fall short within a decade.

How it works: The authors compared the future need for, and availability of, unlabeled language and vision data. To evaluate language data, the authors focused on text from sources like Wikipedia, Arxiv, and libraries of digital books. These sources are subject to editorial or quality control, which makes them especially valuable for training large language models. With respect to vision data, they averaged the number of digital images produced and video uploaded to YouTube, Instagram, Snapchat, WhatsApp, and Facebook.

The authors forecast future supplies of unlabeled data by estimating the current sizes of high-quality data sources. They projected each source’s growth rate based on either global population growth, internet penetration, or economic growth (assuming that research and development consumes a fixed percentage of the global economy). Then they summed the sizes of all sources.
Previous work had found the optimal dataset size for a given processing budget. The authors projected the size of datasets required to train future models based on an earlier projection of processing budgets for machine learning.
Considering projected data supplies and the dataset sizes required to train future models, they determined when the two would intersect; that is, when available data would fail to meet demand.

Results: Dataset sizes needed to train large models will grow much faster than data supplies, the authors concluded.

The current supply of high-quality language data amounts to 10¹² to 10¹³ words, growing at 4 to 5 percent annually. Today’s largest high-quality text datasets, like Pile-CC, already contain roughly 10¹² words, a figure that will need to double about every 11 to 21 months to keep pace. Thus the supply is likely to fall short between 2023 and 2027.
Developers of language models can gain a few years of runway by compromising on data quality. The supply of language data rises to around 10¹⁴ to 10¹⁵ words if it includes unedited sources like social media posts, transcribed human speech, and Common Crawl. The authors expect this expanded pool to grow between 6 and 17 percent each year, which could delay the shortage to sometime between 2030 and 2040.
The supply of vision data amounts to 10¹² to 10¹³ images, growing by about 8 percent annually. The largest vision datasets comprise around 10⁹ total images and will need to double every 30 to 48 months to keep up. Given those growth rates, the authors expect vision data to fall short between 2030 and 2060.

Behind the news: Epoch previously calculated the size and historical growth of training datasets.

The largest high-quality text datasets have grown, on average, 0.23 orders of magnitude a year for three decades, increasing from 10⁵ words in 1992 to 10¹² words in 2022.
Vision datasets have grown more slowly, increasing around 0.11 orders of magnitude per year. For much of the 2010s, the largest vision datasets were based on ImageNet (10⁶ images). Since 2016, however, much larger image datasets have appeared such as Google’s JFT-3B (10⁹ images).

Yes, but: The authors’ estimates have large margins of error, making for very imprecise estimates of time left before data might tap out. Moreover, they mention a number of events that could throw their projections off. These include improvements to the data efficiency of models, increases in the quality of synthetic data, and commercial breakthroughs that establish new sources of data; for instance, widespread use of self-driving cars would produce immense amounts of video.

Why it matters: Despite gains in small data, training on a larger quantity of high-quality data, if it’s available, is a reliable recipe for improved performance. If the AI community can’t count on that improvement, it will need to look elsewhere, such as architectures that don’t require so much data to train.

We’re thinking: Many AI naysayers have turned out wrong when technical innovation overran their imaginations, and sometimes the innovator has thanked the naysayer for drawing attention to an important problem. Data-centric methods improve the quality of data that already exists, enabling models to learn more from less data. In addition, novel training techniques have enabled less data-hungry models to achieve state-of-the-art results. And we might be surprised by the clever ways researchers find to get more data.

Subscribe to The Batch