Last week, the New York Times (NYT) filed a lawsuit against OpenAI and Microsoft, alleging massive copyright infringements. The suit:
- Claims, among other things, that OpenAI and Microsoft used millions of copyrighted NYT articles to train their models
- Gives examples in which OpenAI models regurgitated NYT articles almost verbatim
I’m sympathetic with publishers who worry about Generative AI disrupting their businesses. I consider independent journalism a key pillar of democracy and thus something that should be protected. Nonetheless, I support OpenAI’s and Microsoft’s position more than the NYT’s. Reading through the NYT suit, I found it surprisingly unclear what actually happened and what the actual harm is. (Clearly, NYT's lawyers aren’t held to the same standard of clarity in writing that its reporters are!)
I am not a lawyer, and I am not giving any legal advice. But the most confusing part of the suit is that it seems to muddy the relationship between points 1 and 2. This left many social media commentators wondering how training on NYT articles led ChatGPT to generate articles verbatim.
I suspect many of the examples of regurgitated articles were not generated using only the model's trained weights, but instead arose from a mechanism like RAG (retrieval augmented generation) in which ChatGPT, which can browse the web in search of relevant information, downloaded an article in response to the user’s prompt.
First, regarding point 1, today’s LLMs are trained on a lot of copyrighted text. As I wrote previously, I believe it would be best for society if training AI models were considered fair use that did not require a license. (Whether it actually is might be a matter for legislatures and courts to decide.) Just as humans are allowed to read articles posted online, learn from them, and then use what they learn to write brand-new articles, I would like to see computers allowed to do so, too.
Regarding point 2, I saw a lot of confusion — which would have been unnecessary if the NYT suit had more clearly explained what was happening — about the specific technical mechanism by which ChatGPT might regurgitate an article verbatim and specifically whether 1 leads to 2.
I would love to see the NYT explain more clearly whether the apparent regurgitations were from (i) the LLM generating text using its pretrained weights or (ii) a RAG-like capability in which it searched the web for information relevant to the prompt. These are very different things! Stopping an LLM from regurgitating text retrieved using a RAG mechanism seems technically very feasible, so (ii) seems solvable. Further, I find that after pre-training, an LLM's output — without a RAG-like mechanism — is generally a transformation of the input, and almost never a verbatim regurgitation. If this analysis is inaccurate, I would like to see the NYT clarify this.
So, how bad exactly is (ii)? I can use an online Jupyter notebook (or other development environment) and write instructions that cause it to download and print out copyrighted articles. If I do that, should the provider of the Jupyter notebook be held liable for copyright infringement? If the Jupyter notebook has many other uses that don’t infringe, and the vast majority of users use it in ways that don’t infringe, and it is only my deliberate provision of instructions that cause it to regurgitate an article, I hope that the courts wouldn’t hold the provider of the Jupyter notebook responsible for my actions.
Similarly, I believe that the vast majority of OpenAI’s and Microsoft’s generated output is novel text. So how much should we hold them responsible when someone is able to give ChatGPT instructions that cause it to download and print out copyrighted articles?
Further, to OpenAI’s credit, I believe that its software has already been updated to make regurgitation of downloaded articles less likely. For instance, ChatGPT now seems to refuse to regurgitate downloaded articles verbatim and also occasionally links back to the source articles, thus driving traffic back to the page it had used for RAG. (This is similar to search engines driving traffic back to many websites, which is partly why displaying snippets of websites in search results is considered fair use.) Thus, as far as I can tell, OpenAI has reacted reasonably and constructively.
When YouTube first got started, it had some interesting, novel content (lots of cat videos, for example) but was also a hotbed of copyright violations. Many lawsuits were filed against YouTube, and as the platform matured, it cleaned up the copyright issues.
I see OpenAI and Microsoft Azure rapidly maturing. Many publishers might not like that LLMs are training on their proprietary content. But let’s not confuse the issues. So far, I see relatively little evidence that this leads to regurgitation of nearly verbatim content to huge numbers of users. Further, by closing loopholes to what LLMs with web browsing can and can’t do, many of the issues of regurgitating content verbatim can be resolved. Other potential issues, such as generating images containing famous characters (even when not explicitly prompted to do so) might be harder to resolve, but as the Generative AI industry continues to mature, I’m optimistic that we’ll find good solutions to these problems.