In what will be a closely watched legal salvo, the publisher claims the generative artificial intelligence giant was using its writing “without permission to develop their models and tools.”
As artificial intelligence products from tech giants threaten to upend the media landscape, major newspaper and magazine companies have faced a dilemma: take the money from AI giants in a licensing deal or fight with a lawsuit.
On Wednesday, The New York Times chose the latter, filing a lawsuit in the U.S. Southern District of New York against OpenAI, the Sam Altman-run firm behind the ChatGPT product that has amassed more than 100 million users since a version of it went public in late 2022. The popular chat bot was built by data scraped from across the internet and delivers its customizable responses to user queries and prompts by leaning on this vast trove of writing. If consumers increasingly turn to chat bots to consume or discover news, it could, for example, render individual web pages — the urls that publishers monetize via advertising — as a relic, forcing publishers to radically reshape their future online.
The newspaper company says that it had reached out to OpenAI in April to strike a deal for licensing its content as well as for terms that included “technological guardrails that would allow a mutually beneficial value exchange between Defendants and The Times.” The company added: “These efforts have not produced a resolution.”
The complaint claims the newspaper company had “objected after it discovered that Defendants were using Times content without permission to develop their models and tools.” The suit states it took the legal action “to hold them responsible for the billions of dollars in statutory and actual damages that they owe for the unlawful copying and use of The Times’s uniquely valuable works.”
Interestingly, in one example of where the Times claims OpenAI has infringed on its copyright for articles, the complaint notes that the tech company created a data set, titled WebText, culled from large amounts of content on Reddit. And, within users’ posts on Reddit, there “contains a staggering amount of scraped content from The Times.”
The closely guarded datasets powering ChatGPT and other large language models have become heavily scrutinized as legal challenges have been made against OpenAI. In a May Congressional hearing, Altman kept the explanation vague, saying, “Our models are trained on a broad range of data that includes publicly available content, licensed content, and content generated by human reviewers.”
The newspaper company also said that ChatGPT allows users to go around its paywalled online product, citing a screenshot showing a query asking for the first and second paragraphs of its 2012 Pulitzer Prize-winning “Snow Fall” feature where the bot delivered the copy to a user when prompted. The paper, echoing the critiques of those that have called ChatGPT a mass plagiarism machine, notes that, “In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.”
“As part of training the GPT models, Microsoft and OpenAI collaborated to develop a complex, bespoke supercomputing system to house and reproduce copies of the training dataset, including copies of The Times-owned content,” the complaint alleges. “Millions of Times Works were copied and ingested — multiple times — for the purpose of ‘training’ Defendants’ GPT models.”
The Times claims that Microsoft is already running afoul of its copyrighted content when the tech giant added AI-powered features to its search engine, Bing. “Microsoft and OpenAI continue to create unauthorized copies of Times Works in the form of synthetic search results returned by their Bing Chat and Browse with Bing products. Microsoft actively gathers copies of the Times Works used to generate such results in the process of crawling the web to create the index for its Bing search engine,” the complaint alleges.
The lawsuit arrives two weeks after another major publisher, Axel Springer, the owner of Politico and Business Insider and German newspaper Bild forged a different approach to OpenAI and took the money instead. Under the deal unveiled Dec. 13, brands like Politico will now be used to train OpenAI products in what Axel Springer CEO Mathias Döpfner called an effort “to explore the opportunities of AI empowered journalism.”
Other publishers, including the Rupert Murdoch-controlled News Corp. empire of The Wall Street Journal and New York Post and the Barry Diller-chaired IAC, home to People and The Daily Beast, have been weighing their options. “We are also looking into the future in maximizing the value of our premium content for AI. We are in advanced discussions with a range of digital companies that we anticipate will bring significant revenue in return for the use of our unmatched content sets,” News Corp. CEO Robert Thomson said on a Nov. 9 earnings call, noting that, “Generative AI engines are only as sophisticated as their inputs and need constant replenishing to remain relevant.”
The ever-quotable Diller had presciently set the stage for the standoff earlier this year, noting at a Semafor event in April that major publishers faced a reckoning. “When the internet first began, everything was free. And it was kind of decreed at that time that everything was free, and therefore all publishers said they really had no other choice,” the mogul said, adding that “if publishers do not say, ‘You cannot scrape our content, you cannot take it, you cannot take it transformatively’” they face peril.