News Outlet Challenges AI Developers The New York Times forbids the use of its work in training datasets.

Published

Aug 23, 2023

Reading time

2 min read

The New York Times launched a multi-pronged attack on the use of its work in training datasets.

What’s new: The company updated its terms of service to forbid use of its web content and other data for training AI systems, Adweek reported. It’s also exploring a lawsuit against OpenAI for unauthorized use of its intellectual property, according to NPR. Meanwhile, The New York Times backed out of a consortium of publishers that would push for payment from AI companies.

From negotiation to mandate: The 173-year-old publisher, which has nearly 10 million subscribers across online and print formats, was negotiating with OpenAI to use its material, but talks recently broke down. The New York Times had more success with Google: In February, Google agreed to pay around $100 million to use Times content in search results, although an agreement on AI training was not reported.

The updated The New York Times terms of service prohibit visitors from using text, images, video, audio, or metadata to develop software or curate third-party datasets without explicit permission. The prohibition on software development explicitly includes training machine learning or AI systems. (The terms of service previously prohibited the use of web crawlers to scrape the publisher’s data without prior consent.)
People with knowledge of the potential lawsuit said The New York Times worried that readers could get its reporting directly from ChatGPT.
It’s unclear whether existing United States copyright law protects against AI training. If a judge were to rule in favor of The New York Times, OpenAI might have to pay up to $150,000 per instance of copyright infringement and possibly destroy datasets that contain related works. OpenAI might defend itself by claiming fair use, a vague legal standard that requires a judge’s decision to determine.

Behind the news: Earlier this month, 10 press and media organizations including Agence France-Presse, Associated Press, and stock media provider Getty Images signed an open letter that urges regulators to place certain restrictions on AI developers. The letter calls for disclosure of training datasets, labeling of model outputs as AI-generated, and obtaining consent of copyright holders before training a model on their intellectual property. The letter followed several ongoing lawsuits that accuse AI developers of appropriating data without proper permission or compensation.

Why it matters: Large machine learning models rely on training data scraped from the web as well as other freely available sources. Text on the web is sufficiently plentiful that losing a handful of sources may not affect the quality of trained models. However, if the norms were to shift around using scraped data to train machine learning models in ways that significantly reduced the supply of high-quality data, the capabilities of trained models would suffer.

We’re thinking: Society reaps enormous rewards when people are able to learn freely. Similarly, we stand to gain incalculable benefits by allowing AI to learn from information available on the web. An interpretation of copyright law that blocks such learning would hurt society and derail innovation. It’s long past time to rethink copyright for the age of AI.

Subscribe to The Batch