Can I Use This Data? Conflict over information sources sparked legal and business turmoil in 2023.

Published

Dec 20, 2023

Reading time

2 min read

Information may not want to be free after all.

What happened: The age-old practice of training AI systems on data scraped from the web came into question as copyright owners sought to restrict AI developers from using their works without permission.

Driving the story: Individual copyright holders filed lawsuits against AI companies for training models on their data without obtaining explicit consent, giving credit, or providing compensation. Concurrently, formerly reliable repositories of data on the open web started to require payment or disappeared entirely.

A group of visual artists filed a class-action lawsuit claiming that Midjourney, Stability AI, and online artists’ community DeviantArt infringed their copyright by enabling users to create images in the styles of artists. Getty, a provider of stock images, also sued Stability AI for allegedly using Getty pictures without permission.
High-profile writers and The Authors’ Guild filed a similar lawsuit against OpenAI, claiming that the company infringed their copyrights by training models on their work. Universal Music Group sued Anthropic for training language models on copyrighted song lyrics.
The websites Reddit and Stack Overflow, which have been popular resources for training language models, began charging developers to use their data. The New York Times changed its terms of service to explicitly forbid training AI models from its data.
The Books3 corpus, which contains nearly 200,000 digitized books copied without permission, was part of The Pile, an 800GB corpus that has been used to train popular large language models. In August, the Rights Alliance, an anti-piracy group, forced a web host to remove the corpus.
With open data sources at risk of copyright enforcement, OpenAI entered into agreements with Shutterstock and Axel Springer to use their images and news, respectively. Adobe, Anthropic, Google, IBM, Microsoft, OpenAI, and Shutterstock pledged to take responsibility for some copyright actions that arise from using their generative models.

Copyright conundrum: Whether copyright restricts training machine learning models is largely an open question. Laws in most countries don’t address the question directly, leaving it to the courts to interpret which uses of copyrighted works do and don’t require a license. (In the U.S., the Copyright Office deemed generated images ineligible for copyright protection, so training corpuses made up of generated images are fair game.) Japan is a notable exception: The country’s copyright law apparently allows training machine learning models on copyrighted works.

Where things stand: Most copyright laws were written long ago. The U.S. Copyright Act was established in 1790 and was last revised in 1976! Copyright will remain a battlefield until legislators update laws for the era of generative AI.

Subscribe to The Batch