The New York Times’ Copyright Lawsuit Against OpenAI Threatens the Future of AI and Fair Use
The New York Times recently sued Microsoft and OpenAI, claiming that their AI services, such as ChatGPT and Copilot, have unlawfully utilized the New York Times’ content and demanding they dismantle all large language models (LLMs) trained on its articles. However, the newspaper’s complaint misrepresents how LLMs function and selectively uses examples to construct a narrative that appeals to moral sensibilities yet fails to establish a solid legal argument. Moreover, The New York Times’ proposed remedy would erase nearly every existing LLM, halting a technology poised to deliver significant advancements to society. While the courts will likely eventually side against The New York Times, if publishers take their fight to Congress, policymakers should be prepared to uphold the right of AI developers to train AI systems using publicly accessible data on the Internet.
Like other unsuccessful lawsuits from content creators against developers of LLMs, this case is motivated more by fear of being replaced by AI, rather than a solid interpretation of copyright law. At the heart of the lawsuit is the Times’ accusation that LLMs are “mass copying” machines that, when prompted, “will output near-verbatim copies of significant portions of Times’ works.” However, as OpenAI has rightly argued, training of LLMs with content publicly available on the Internet comes under the fair use principle, given the transformative nature of this process. Just as people are permitted to learn and develop writing skills and produce work by studying existing copyright-protected works, LLMs should be afforded the same opportunity. Moreover, the Times’ argument that LLMs merely replicate content verbatim oversimplifies and misrepresents the complex mechanisms underlying these AI models. Far from merely memorizing and regurgitating data, LLMs synthesize vast amounts of information to build probabilistic models that predict likely text sequences. While the Times’ has pointed out instances where LLMs reproduced parts of their popular articles like “Snow Fall: The Avalanche at Tunnel Creek,” a simple Google search for the article reveals multiple online sources containing large portions of the same article. The presence of widely copied material online can lead to its occasional replication by LLMs. However, this does not indicate a systematic problem within LLMs, as evidenced by ChatGPT’s inability to replicate less popular Times articles verbatim in the Times’ tests.
The application of copyright law hinges significantly on the context and purpose of the alleged copying. The concept of “fair use,” particularly when using works for research or to enable technological advancement, is a well-established principle. For example, Google’s use of book snippets in its search results is an example of the fair use principle in practice. The transformative nature of LLM model-making and its diverse applications are markedly different from the original intent of publishing news articles. LLMs serve diverse functions ranging from translation and coding assistance to essay writing and grammar support. An example of this diverse application is ChatGPT’s successful diagnosis of a rare disease in a boy, a challenge that had stumped 17 doctors. This instance vividly demonstrates the unique capabilities of LLMs in fields well beyond the journalistic intent of news articles.
Another important aspect of this debate is the perceived threat that AI poses to traditional news outlets like The New York Times. The lawsuit suggests that AI models could undermine the Times’ paywall by providing similar content. However, these AI systems neither claim to replicate the Times’ content nor do they directly compete as a source of news. AI, which lacks human reasoning and judgment, cannot fully replicate the nuances and depth of analysis that reputable journalism offers. The Times’ longstanding reputation for credibility and authoritative reporting is something that AI, in its current state, cannot challenge. Hence, when it comes to current news, LLMs will not reduce the market for traditional news. Further, OpenAI has established a policy allowing website owners to block their content from being used in AI model training. It was not until August 2023 that the New York Times updated its terms and conditions to restrict the use of its content for AI training. In the case of the New York Times’ archive of articles on various topics, there will be instances where LLMs are in direct competition. For example, if one were researching the important technological innovations of this century, one could go to the archives of a newspaper or ask an LLM. However, copyright law does not extend its protection to facts and ideas. Furthermore, when a user makes that query, LLMs are not using the archives of a single news organization; rather, they are predicting a word at a time based on the innumerous writings on various technological innovations of this century.
Throughout history, emerging technologies have faced resistance. The initial resistance to the printing press, for instance, mirrors The New York Times’ apprehensions about AI. Yet, just as the printing press revolutionized information dissemination and led to societal progress, AI also promises similar transformative potential. Additionally, the 1920s saw the rise of commercial broadcast radio, a powerful new technology in mass communication, which created significant disruption in the news industry. Newspapers, once dominant, faced a crisis as consumers and advertisers increasingly turned to radio. In a more recent parallel, traditional news media have directly accused platforms like Google and Facebook of exploiting local news content for their own profit. Each of these technological advancements initially disrupted established media but eventually led to a richer, more varied media ecosystem. This ongoing pattern highlights a critical theme: the constant evolution of technology inevitably challenges established media, necessitating adaptation and innovation. The integration of AI in content creation and consumption is the latest iteration of this historical progression, promising to add new dimensions to how information is processed, presented, and accessed.
The New York Times’ lawsuit mischaracterizes the nuanced dynamics of AI development and the principles of fair use for news articles available online. While it is crucial for policymakers to address legitimate copyright infringement concerns, such as rampant pirated content on the Internet, training AI models on information freely available on the Internet is not one of those.