Dear friends,
On Monday, a United States District Court ruled that training LLMs on copyrighted books constitutes fair use. A number of authors had filed suit against Anthropic for training its models on their books without permission. Just as we allow people to read books and learn from them to become better writers, but not to regurgitate copyrighted text verbatim, the judge concluded that it is fair use for AI models to do so as well.
Indeed, Judge Alsup wrote that the authors’ lawsuit is “no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works.” While it remains to be seen whether the decision will be appealed, this ruling is reasonable and will be good for AI progress. (Caveat: I am not a lawyer and am not giving legal advice.)
AI has massive momentum, but a few things could put progress at risk:
- Regulatory capture that stifles innovation, including especially open source, in the false name of “AI safety”
- Loss of access to cutting-edge semiconductor chips (the most likely cause would be war breaking out in Taiwan)
- Regulations that severely impede access to data for training AI systems
Access to high-quality data is important. Even though the mass media tends to talk about the importance of building large data centers and scaling up models, when I speak with friends at companies that train foundation models, many describe a very large amount of their daily challenges as data preparation. Specifically, a significant fraction of their day-to-day work follows the usual Data Centric AI practices of identifying high-quality data (books are one important source), cleaning data (the ruling describes Anthropic taking steps like removing book pages' headers, footers, and page numbers), carrying out error analyses to figure out what types of data to acquire more of, and inventing new ways to generate synthetic data.
I am glad that a major risk to data access just decreased. Appropriately, the ruling further said that Anthropic’s conversion of books from paper format to digital — a step that’s needed to enable training — also was fair use. However, in a loss for Anthropic, the judge indicated that, while training on data that was acquired legitimately is fine, using pirated materials (such as texts downloaded from pirate websites) is not fair use. Thus, Anthropic still may be liable on this point. Other LLM providers, too, will now likely have to revisit their practices if they use datasets that may contain pirated works.
Overall, the ruling is positive for AI progress. Perhaps the biggest benefit is that it reduces ambiguity with respect to AI training and copyright and (if it stands up to appeals) makes the roadmap for compliance clearer. This decision indicates it is okay to train on legitimately acquired data to build models that generate transformational outputs, and to convert printed books to digital format for this purpose. However, downloading from pirate sites (as well as permanently building a “general purpose” library of texts, stored indefinitely for purposes to be determined, without permission from the relevant copyright holders) are not considered fair use.
I am very sympathetic with the many writers who are worried about their livelihoods being affected by AI. I don‘t know the right solution for that. Society is better off with free access to more data; but if a subset of people is significantly negatively affected, I hope we can figure out an arrangement that compensates them fairly.
Keep building!
Andrew