New York Times sues OpenAI, Microsoft over ‘millions of articles’ used to train ChatGPT

The New York Times has sued Microsoft and OpenAI, claiming the duo infringed the newspaper’s copyright by using its articles without permission to build ChatGPT and similar models. It is the first major American media outfit to drag the tech pair to court over the use of stories in training data.

As with similar suits – including action taken by various artists and creators, such as Sarah Silverman – the NYT complaint [PDF] centers around the use of copyrighted material – in this case from The Times – in the training of the large language models (LLMs) behind various Microsoft and OpenAI chatbots and generative AI services.

The complaint calls out Microsoft, not just for the investment it has made in OpenAI, but also for assistants such as Microsoft 365 Copilot and Bing Chat which the complaint alleges: “Display Times content in generative output in at least two ways: (1) by showing ‘memorized’ copies or derivatives of Times works retrieved from the models themselves, and (2) by showing synthetic search results that are substantially similar to Times works generated from copies stored in Bing’s search index.”

The newspaper is pretty upset that “millions” of its copyrighted articles were harvested to form a chunk of Microsoft and OpenAI’s models without permission, and that these neural networks will regurgitate that work on demand for users, again without permission.

In its complaint, the NYT gives examples it alleges prove ChatGPT has been trained on its content. Furthermore, a simple paywall-dodging question to ChatGPT appears to result in responses containing copyrighted text.

And it is the paywall-dodging of OpenAI’s content scraping that has attracted particular scrutiny. According to the complaint, the newspaper began stashing its work behind a paywall 12 years ago and, as of the third quarter of 2023, laid claim to 10.1 million digital and print subscribers. It aims to increase that number to 15 million by the end of 2027.

Occasional readers are also catered to, with free access to a limited number of articles before a subscription is demanded. NYT reckons it attracts 50 to 100 million users per week with such an approach, with advertising further filling its coffers.

The complaint explains: “The Times depends on its exclusive rights of reproduction, adaptation, publication, performance, and display under copyright law to resist these forces. The Times has registered the copyright in its print edition every day for over 100 years, maintains a paywall, and has implemented terms of service that set limits on the copying and use of its content. To use Times content for commercial purposes, a party should first approach The Times about a licensing agreement.”

However, to drive traffic to its site, the NYT also permits search engines to access and index its content. “Inherent in this value exchange is the idea that the search engines will direct users to The Times’s own websites and mobile applications, rather than exploit The Times’s content to keep users within their own search ecosystem.”

To use Times content for commercial purposes, a party should first approach The Times about a licensing agreement

The Times added it has never permitted anyone – including Microsoft and OpenAI – to use its content for generative AI purposes. And therein lies the rub. According to the paper, it contacted Microsoft and OpenAI in April 2023 to deal with the issue amicably. It stated bluntly: “These efforts have not produced a resolution.”

And so we find ourselves with a complaint that alleges “a business model based on mass copyright infringement” and details the journey of OpenAI from its beginnings as a “non-profit artificial intelligence research company” in 2015 to today’s behemoth.

According to the complaint: “Despite its early promises of altruism, OpenAI quickly became a multi-billion-dollar for-profit business built in large part on the unlicensed exploitation of copyrighted works belonging to The Times and others.”

So what to do? Unsurprisingly, NYT is seeking damages. It also demands a jury trial and wants the court to order the destruction “of all GPT or other LLM models and training sets that incorporate Times works.”

Earlier this month, Axel Springer and OpenAI announced a plan to make summaries of the former’s content – including paid content – available from the latter’s products, including ChatGPT. The plan is to ensure answers to user queries include attribution and links to the full articles.

How much the deal was worth is unclear. According to the Financial Times, an eight-figure sum was involved. As noted in its complaint, the NYT has also had discussions, but clearly, the outcome was unsatisfactory. ®


This website uses cookies. By continuing to use this site, you accept our use of cookies.