OpenAI says complying with copyright 'impossible' in generative AI

Hello and welcome to Eye on AI. Copyright issues continue to dominate AI news this week.

First, U.K. newspaper the Telegraphsurfaced a submission OpenAI made to the British House of Lords, which is considering whether to update the country’s copyright laws to address issues raised by generative AI. In the submission, OpenAI claimed creating “leading AI models” would be “impossible” without using copyrighted material and that relying instead on text in the public domain would result in AI that fails to “meet the needs of today’s citizens.”

The U.K. already has a copyright exemption for “text and data mining” that applies to noncommercial research projects (it happens to have also led to concerns about nonprofits being set up to engage in “data laundering” for businesses). But now OpenAI is arguing that the country should create a similar exemption for all AI model training. Without a right to use copyrighted material for training, OpenAI claims, ChatGPT would cease to exist.

As many have pointed out, OpenAI’s submission is, at best, disingenuous. It fails to discuss the option of licensing copyrighted material for training—in fact, fails to note that OpenAI itself is currently pursuing such licenses for at least some training data, while simultaneously arguing it is impossible. Other companies have also managed to create good generative AI models without taking copyrighted material without consent (more on that in a minute). If companies want to argue that finding rights holders and obtaining licenses is too onerous, there are even some other creative options. It might be possible for the government to establish a fund, paid for by a tax on the sale of generative AI models, to which rights holders could apply for compensation. This is exactly the approach the U.S. Congress once took for recording artists with the 1992 Audio Home Recording Act.

Of course, not all of OpenAI’s negotiations over licenses have gone swimmingly. On Monday, the company also posted a blog responding to the New York Times’ copyright infringement lawsuit against it. In the blog, OpenAI said the newspaper’s lawsuit was “without merit.” It also says that it had thought negotiations with the newspaper were proceeding well, with discussions continuing up until Dec. 19, and that it had been “surprised and disappointed” when the newspaper unexpectedly filed suit against it on Dec. 27. But OpenAI hinted that there was likely a big chasm between what the New York Times was demanding and what OpenAI was willing to pay. It said it had tried to explain to the newspaper’s representatives that “like any single source, their content didn’t meaningfully contribute to the training of our existing models and also wouldn’t be sufficiently impactful for future training.” In other words, OpenAI was trying to get away with offering the Times peanuts. The Times no doubt felt it deserved sirloin steak (or a crab cake, at least).

OpenAI’s framing of its position helps explain why negotiations with copyright holders are going to be so contentious. If the tech companies are forced to pay for data, they only want to pay for the marginal information value that data provides the model. These large language models ingest so much text that in most cases, as OpenAI argues, the value of any one source—even the New York Times with its reputation for journalistic excellence and vast archive of millions of articles—is minimal. But many rights holders are not primarily concerned with the information value of their data. They are mostly worried about the threat that trained model will pose to their future revenue. If people come to rely on chatbots to summarize the news and do research for them, no one will actually visit the New York Times website. So the revenue loss could be large. Rights holders feel they should be compensated to some degree for that potential loss. Bridging this gap will likely require an adjustment of expectations on the part of both sides—much as happened with music streaming.

In its blog post, OpenAI also says that the New York Times, in its lawsuit and accompanying exhibits, has not been candid abouthow easy it is to produce copyright-infringing material using OpenAI’s models. The lawsuit included hundreds of examples in which the paper said it was able to get ChatGPT and other OpenAI models to spit out verbatim copies of stories when prompted with a snippet from the original but no explicit instruction to produce a Times’ story. OpenAI tried to characterize this “regurgitation” as a rare bug (more on that in a moment too) and that the New York Times either was not disclosing its prompts accurately or had cherry-picked its examples from thousands of attempts. OpenAI said regurgitation was more likely to occur when a single piece of text has been ingested multiple times during training, as was more likely to happen with Times’ stories because they are syndicated and reprinted in many other publications.

But here again, what OpenAI says is misleading. It is becoming increasingly apparent that regurgitation is not some highly unusual bug but a relatively common feature of most generative AI models. Last week, Gary Marcus, emeritus New York University cognitive scientist and prolific AI commentator, and Reid Southen, a commercial film concept artist, collaborated on research published in IEEE Spectrum that showed how easy it is to get both Midjourney’s latest text-to-image generator and OpenAI’s DALL-E 3 to regurgitate—or, as Marcus and Southen said, “plagiarize”—copyrighted content. They showed that it was trivial to get the models to produce Marvel, Warner Brothers, and Disney characters, including images that were nearly identical to film stills the studios released. They showed this could be done even without naming the movie or the characters. For Midjourney, they demonstrated that simply using the prompt “screencap” was enough to produce content nearly identical to copyrighted film stills. (Midjourney did not respond to a request to comment for this story.)

Researchers have previously shown it is possible to get LLMs to leak training data in their outputs, including personally identifiable information. It has also been shown that some images are so iconic that it can be difficult for image-generating models not to copy them in their outputs. A classic example by now is the prompt “Afghan girl” which in early versions of Midjourney always returned an image strikingly similar to Steve McCurry’s famous National Geographic cover photo. Midjourney has since disallowed that prompt and OpenAI seems to have tweaked DALL-E to force the model to return a different sort of image. But the point is regurgitation isn’t some rare quirk. It’s an inherent problem with how all LLMs work, one that has to be addressed by post-generation filtering, specific fine-tuning, or prohibiting certain prompts.

Marcus has been quick to claim these copyright issues mean that OpenAI’s business model is broken and that it, and the whole rest of the generative AI boom, is about to collapse as a result. I don’t think that will happen—although I do think business models may have to change. This week, I spoke to the CEO of one company that shows that it is possible to get on the right side of these issues and still make a buck: Getty Images. It has partnered with Nvidia to create a generative AI still image product, which it made available through its iStock service this week, as well as with Runway on a forthcoming video generation product. Getty CEO Craig Peters tells me the company is committed to “commercially safe” generative AI. He says this means its AI offerings have been trained only using Getty’s own library of licensed images, they won’t output images of celebrities and other people that might cause commercial rights issues, and they won’t output any trademarked logos or characters either.

Even though Getty already had a right to use these images, Peters says the company wants to ensure creators receive additional compensation for their contribution to generative AI. Getty has done this by giving anyone whose images are part of the training base a share of the revenue the company brings in from its generative AI product. Right now, this is allotted according to the proportion of the overall training set that the creator represents and also a metric for how often their imagery is currently being purchased from Getty’s stock catalogue. Peters says this second figure serves as a proxy for content quality. He says that in the future he would be in favor of a system that would reward creators for their contribution to any particular AI output, but that the technology to do so doesn’t currently exist.

Getty’s experience proves that copyright issues around generative AI can be overcome. Sure, it helps if business models align. Neither Getty’s nor Nvidia’s existing business was cannibalized by the new product. OpenAI and the New York Times are in a trickier situation. Any product the Times helps OpenAI build would likely cannibalize its existing advertising and subscription model. But a deal would only be “impossible” if one uses OpenAI’s favored definition of the word.

And with that, more AI news below.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

AI IN THE NEWS

Google DeepMind spin-out Isomorphic signs deals with Lilly and Novartis. Isomorphic, a Google-owned drug discovery company that was spun out of Google DeepMind, has signed partnerships with the two big pharmaceutical companies to work on new drugs. The deal with Lilly includes a $45 million upfront payment and more than $1.7 billion in potential milestone payments. The one with Novartis is for $37.5 million, with $1.2 billion possible based on performance targets. The announcement did not specify what diseases or conditions Isomorphic will be targeting in either partnership. Both deals, however, mention that the therapies will be based on small molecules, which is an indication that Isomorphic is looking at more than just protein-based therapies. The company, whose CEO is Demis Hassabis, the cofounder and CEO of Google DeepMind, was originally set up following DeepMind’s breakthroughs in using AI to predict protein structures. You can read more in this Fierce Biotech story here.

OpenAI was offering publishers low-seven-figure deals. That’s according to a story in The Information. The relatively low numbers may help explain why the New York Times decided to break off talks and sue the AI company and why some other publishers have not carried discussions with OpenAI forward. The publication said that the company was offering publishers payments of between $1 million and $5 million annually for using their content to train its AI models, significantly less than the multiyear $50 million deals that it said Apple was offering news organizations as it seeks to catch up with OpenAI and Google in generative AI. But it said Apple was seeking broader rights than OpenAI. Google has also been in discussions with publishers about using their content in generative AI products.

Meta’s open-source AI models are being used to build sex chatbots—including some that simulate child sexual exploitation. That’s the disturbing report from my Fortune colleagues Ben Weiss and Alexandra Sternlicht who delved into the burgeoning marketplace for AI-powered erotic chatbots. Many of these sex chatbots offer some role-playing scenarios designed to simulate child sexual exploitation—raising thorny legal and ethical issues. And, Sternlicht and Weiss report, the most popular LLMs powering these services are Meta’s Llama and Llama 2 models. This seems to be largely because they are open source, so developers can download them for free, modify and fine-tune them easily, and Meta has few levers to pull to control their use. Child sexual abuse scenarios would violate Meta’s licensing terms for Llama 2 but it is unclear what steps the company plans to take to enforce those terms and how effective they will be. You can read their story here.

Mickey Mouse enters the public domain and immediately gets its own GenAI model. The first three cartoons by Walt Disney to feature Mickey Mouse finally entered the public domain this year. And immediately a clever developer used a few dozen still frames from those early films to train a fine-tuned version of text-to-image generator Stable Diffusion called Mickey-1928 that allows users to create images of the iconic mouse—as well as Minnie Mouse and Peg Leg Pete— in all kinds of poses and situations in the style of those three early films. The model is available on Hugging Face. Ars Technica has a story with further details. The three films are Steamboat Willie, Plane Crazy, and The Gallopin’ Gaucho.

EYE ON AI RESEARCH

For high-use models, train smaller models on more data for longer. Over the past two years, there has been a lot of discussion among machine learning engineers about the optimal model size, amount of data, and training time for AI models. The original “scaling laws,” first developed by OpenAI, more or less naively equated performance with model size. But they were overturned by Google DeepMind’s now famous 2022 “Chinchilla paper.” Chinchilla said that for a transformer-based model, you got better performance if the number of tokens the model was fed increased proportionally to the model’s size. If the model got twice as big, so should the amount of data (well, technically, the number of tokens, but they are somewhat equivalent). This also meant that a smaller model—which is cheaper to train and run in production—could equal or exceed the performance of a larger model if it was fed more data. DeepMind had shown that its Chinchilla LLM, with 70 billion parameters (which are tunable variables and used to indicate model size), could beat its Gopher model with 280 billion parameters if Chinchilla was given four times the data.

Now researchers at MosaicML, which is owned by Databricks, have published a paper showing that for models that are going to be used to answer more than 1 billion queries over their lifetime—which might be true for many models deployed by large businesses with lots of employees or customers—you get better performance and much lower cost by training a smaller model for longer and on more data than even the Chinchilla paper would suggest. For instance, they demonstrate that to equal the performance of a 30 billion parameter model, it might be possible to train a 13.6 billion parameter model instead but on 2.84 times the data. (Chinchilla would have suggested just twice the data.) In some cases, the researchers showed that following their method could cut the lifetime financial cost of these models in half. I expect that given the concerns about the runaway expense of generative AI among a lot of business leaders, this paper is going to get a lot of attention. You can read it on the non-peer review research repository arxiv.org here.