The latest advances in AI are built on freely available training data. What will happen if it becomes off-limits?
The fear: Creative workers don’t want AI developers to train models on their works without permission or compensation, or at all. Data is vanishing as they scramble to lock it down.
Horror stories: Generative AI models readily produce outputs that imitate the styles of individual authors and artists. Creative people and organizations that work on their behalf are reacting by suing AI developers (all proceedings are ongoing at publication time) and restricting access to their works.
- A class-action lawsuit against Microsoft, OpenAI, and Github claims that OpenAI improperly used open source code to train Github’s Copilot code-completion tool.
- Several artists filed a class-action lawsuit against Stability AI, Midjourney, and the online artist community DeviantArt, arguing that the companies violated the plaintiffs’ copyrights by training text-to-image generators on their artwork.
- Universal Music Group, which accounts for roughly one-third of the global revenue for recorded music, sued Anthropic for training its Claude 2 language model on copyrighted song lyrics.
- The New York Times altered its terms of service to forbid scraping its webpages to train machine learning models. Reddit and Stack Overflow began charging for their data.
- Authors brought a class-action lawsuit against Meta, claiming that it trained LLaMA on their works illegally. The Authors Guild sued OpenAI on similar grounds.
- The threat of a lawsuit by a Danish publishers’ group persuaded the distributor of Books3, a popular dataset of about 183,000 digitized books, to take it offline.
Survival in a data desert: Some AI companies have negotiated agreements for access to data. Others let publishers opt out of their data-collection efforts. Still others are using data already in their possession to train proprietary models.
- OpenAI cut deals with image provider Shutterstock and news publisher The Associated Press to train its models on materials they control.
- Google and OpenAI recently began allowing website owners to opt out of those companies’ use of webpages to train machine learning models.
- Large image providers Getty and Adobe offer proprietary text-to-image models trained on images they control.
Facing the fear: Copyright holders and creative workers are understandably worried that generative AI will sap their market value. Whether the law is on their side remains to be seen. Laws in many countries don’t explicitly address use of copyrighted works to train AI systems. Until legislators set a clear standard, disagreements will be decided case by case and country by country.