What a Stanford study of AI copilots for lawyers says about AI for everyone else

Hello and welcome to Eye on AI.

Name a profession and there’s almost certainly someone building a generative AI copilot for it. Accountants, lawyers, doctors, architects, financial advisors, marketing copywriters, software programmers, cybersecurity experts, salespeople—there are copilots already in the market for all of these roles.

AI copilots differ from using a general-purpose LLM-based chatbot, like OpenAI’s GPT models, although some use one of those general-purpose models as their central component. Copilots have user interfaces and usually backend processes specifically tailored for the particular tasks that someone in that profession would want assistance with—whether that is crafting an Excel spreadsheet formula for an accountant, or, for a salesperson, figuring out the best wording to convince a customer to close a complex deal. Many copilots rely on a process called RAG—retrieval augmented generation—to boost the accuracy of the information they output and reduce the tendency of LLMs to hallucinate or produce superficially plausible but inaccurate information.

Perhaps no profession save software developers has embraced experimentation with copilots as enthusiastically as the law. There have already been several instances where lawyers—including former Trump lawyer-turned-star-witness-for-the-prosecution Michael Cohen—have been reprimanded and fined by judges for naively (or very lazily) using ChatGPT for legal research and writing without checking the case citations it produced, which in some cases turned out to be completely invented. The legal copilots, however, are supposed to be much better than ChatGPT at completing legal tasks and answering legal questions.

But are they? The answer matters because lawyers’ experience using these copilots may foretell what will happen in other professions too in the coming few years. In that context, a study published last month (and updated Friday) from researchers affiliated with Stanford University’s Human-Centered AI Institute (HAI) sounded an important caution—not just for the legal profession but for copilots as a whole.

The HAI researchers, who included a Stanford Law professor, created a dataset of 200 questions designed to mimic the kinds of questions a lawyer might ask a legal research copilot. The Stanford team claims their questions are a better test of how legal copilots may perform in a real-world setting than bar exam questions—especially because a lot of datasets of bar exam questions have already been memorized by LLMs trained for vast amounts of data scraped from the internet. The dataset includes some particularly tricky questions in which the question includes a false premise. Such questions often lead LLMs astray. Trained to be helpful and agreeable, they frequently accept the false premise and then invent information to justify it, rather than telling the user the premise of the question is wrong.

The researchers then tested several prominent legal research copilots, including one from LexisNexis (Lexis+ AI) and two from Thomson Reuters (Ask Practical Law AI and Westlaw’s AI-Assisted Research) on this dataset. They used OpenAI’s GPT-4 as a kind of control, to see how well an LLM would do without RAG and without any of the other backend processing that had been geared just for legal research. The answers were evaluated by human experts.

For lawyers—and everyone hopeful RAG would eliminate hallucinations—there was a little bit of good news and quite a lot of not-so-good news in the results. The good news is that RAG did indeed reduce hallucination rates significantly. GPT-4 had a hallucination rate of 43% while the worst of the three legal copilots had a hallucination rate of 33%. The bad news is that the hallucination rates were still much higher than you’d want. The best two copilots still made up information in about one out of six instances. Worse still, the RAG-based legal copilots often omitted key information from answers, with between nearly a fifth to well over half of the responses judged by human evaluators as incomplete. By contrast, fewer than one in 10 of GPT-4’s responses failed on this metric. The study also pointed out that LexisNexis’s copilot provided legal citations for all the information it provided, but that sometimes the cases cited did not say what the copilot said they did. The researchers pointed out that this kind of error can be particularly dangerous because the presence of the citation to a real case can make lawyers complacent, making it easier for errors to slip past.

LexisNexis and Thomson Reuters have both said that the accuracy figures in the HAI study were significantly lower than what they’ve found in their own internal performance testing and in feedback from customers. “Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it, and we’ve been very clear with customers that the product can produce inaccuracies,” Mike Dahn, head of Westlaw Product Management at Thomson Reuters, wrote in a blog response to the HAI study.

“LexisNexis has extensive programs and system measures in place to improve the accuracy of responses over time, including the validation of citing authority references to mitigate hallucination risk in our product,” Jeff Pfeifer, LexisNexis chief product officer for the U.S., Canada, Ireland, and the U.K., wrote in a statement provided to newsletter LegalDive.

The blog post HAI wrote to accompany the research pointed to a recent story by Bloomberg Law that also could give people pause. It looked at the experience of Paul Weiss Rifkind Wharton & Garrison—among the 50 largest U.S. law firms with close to 1,000 attorneys—with a legal copilot from the startup Harvey. Paul Weiss told the news organization that it wasn’t using quantitative metrics to assess the copilot because, according to Bloomberg, “the importance of reviewing and verifying the accuracy of the output, including checking the AI’s answers against other sources, makes any efficiency gains difficult to measure.” The copilot’s answers could also be inconsistent—with the same query yielding different results at different times—or extremely sensitive to seemingly inconsequential changes in the wording of a prompt. As a result, Paul Weiss said it wasn’t in a position yet to determine the return on investment from using Harvey.

Instead, Paul Weiss was evaluating the copilots based on qualitative metrics, such as how much attorneys enjoyed using them. And here, there were some interesting anecdotes. It turned out that while junior lawyers might not see much time-savings in using the AI copilot for research because of the need to verify its answers, more senior lawyers found the copilot to be a very useful tool for helping them brainstorm possible legal arguments. The firm also noted that the copilot could do certain things—such as evaluate every single contract in a huge database in minutes—that humans simply could not do. In the past, firms had to rely on some sort of statistical sampling of the contracts, and even then the process might take days or weeks.

Pablo Arredondo, cofounder of CoCounsel, a legal copilot now owned by Thomson Reuters, but which was not part of the HAI study, told me that the HAI study and the Bloomberg story reinforce that all generative AI legal copilots need oversight (as do junior associates at law firms). Some of the areas where the copilots stumbled in the HAI study, such as determining when a case had been overturned subsequently by a higher court, are also areas where different legal research companies often provide conflicting information, he noted.

Taken together, I think the Stanford study and the Bloomberg Law story say a lot about where AI copilots are today and how we should be thinking about where they are heading. Some AI researchers and skeptics of the current hype around generative AI have jumped on the HAI paper as evidence that LLMs were entering the “trough of disillusionment” and that perhaps the entire field is about to enter another “AI winter.” I think that’s not quite right. Yes, the Stanford paper points to serious weaknesses in AI copilots. And yes, RAG will not cure hallucinations. But I think we will find ways to continue to minimize hallucinations (longer context windows is one of them) and that people will continue to use copilots.

The HAI paper makes a great case for rigorous testing—and for that performance data to be shared with users. Professionals must have a clear sense of copilots’ capabilities and weaknesses and need to understand how they are likely to fail. Having this mental model of how a particular copilot works is essential for any professional working alongside one. Also, as the Bloomberg Law story suggests, many professionals will come to find copilots useful and helpful, even in cases when they aren’t entirely accurate—and that the efficiency gains from such a system may be hard to evaluate. It’s not about whether the copilot can do well enough on its own to replace human workers. It’s about whether the human working with the copilot can perform better than they could on their own—just as in the case of the senior Paul Weiss lawyers who said it helped them think through legal arguments.

Arredondo said that Thomson Reuters is in early discussions with Stanford to form a consortium of legal tech firms and law firms to partner along with other academic institutions, to develop and maintain benchmarking for legal copilots. He said that ideally, these standards would compare how human lawyers perform on these same tests and then see how they perform when assisted by AI tools, as opposed to evaluating the systems only against one another and without the human oversight they still need.

We don’t have very good benchmarks for human-AI teaming. It’s time to create some.

There’s more AI news below…But first, if you want to find out more about working alongside AI copilots, I’ve got some news of my own: My book Mastering AI: A Survival Guide to Our Superpowered Future is now available for pre-order in the U.S. and the U.K.! The book has a chapter on how AI will transform the way we work. But Mastering AI goes well beyond that to reveal how AI will change and challenge our democracy, our society, and even ourselves. AI presents tremendous opportunities in science, education, and business, but we must urgently address the substantial risks this technology poses. In Mastering AI I explain how. If you enjoy this newsletter, I know you’ll find the book valuable. Please consider pre-ordering your copy today.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Correction, June 4: An earlier version of this story misspelled the full name of the law firm Paul Weiss Rifkind Wharton & Garrison.

AI IN THE NEWS

Current and former OpenAI, Google DeepMind staff calls for greater whistleblower protections. A group of former and current employees of both OpenAI and Google DeepMind have signed an open letter calling for staffers at AI companies to be given a “right to warn” the public about AI safety concerns. The letter, which was also signed by AI pioneers Geoffery Hinton and Yoshua Bengio, asks AI companies to commit to not using nondisparagement agreements to keep former employees from speaking out about safety concerns at their current or former employer; to creating processes for employees to raise safety concerns anonymously with companies’ boards and with regulators; and to not retaliate against any whistleblowers. In a New York Times story about the core group of current and former OpenAI employees behind the open letter, Daniel Kokotaljo, a former OpenAI policy researcher who recently left the company, accused OpenAI of “recklessly racing” to achieve AGI—an AI system as intelligent as humans—without due regard for safety.

Nvidia announced its latest generation of GPUs only months after having revealed its previous generation. At the start of the COMPUTEX computer conference in Taiwan, Nvidia CEO Jensen Huang announced the semiconductor company's next-generation Rubin graphics processing unit, the kind of computer chips needed for AI applications. The announcement came just two months after Nvidia had unveiled Blackwell, the successor to its H100 Hopper GPUs, its current highest-performing system. Blackwell is only due to start arriving with customers later this year, and in the past Nvidia had tended to announce major new GPU systems only once every two years. Huang said that the company plans to move to a one-year release "rhythm" for new chips, according to CNBC. Still, the fact that Rubin was announced so quickly after Blackwell may pose a dilemma for customers who may be reluctant to purchase Blackwell chips if they know a powerful one will be available just months later.

Elon Musk diverts GPUs intended for Tesla to his AI startup xAI. That’s according to a report from CNBC based on internal Nvidia emails it said it obtained. Musk had promised Tesla investors that the electric vehicle maker, whose stock has been under pressure amid slowing sales, would increase its stock of Nvidia graphics processing units (GPUs), the powerful chips needed for AI applications, from 35,000 to 85,000 by year’s end. He portrayed the AI investment as important for Tesla’s development of autonomous vehicles and humanoid robots. But the emails suggest Musk had actually ordered fewer of these chips for Tesla and had redirected portions of what he had ordered to xAI, possibly causing a delay in Tesla’s own AI-related plans. The story is likely to further annoy Tesla shareholders who are eager for the stock to return to growth.

Sam Altman’s outside investments raise potential conflict-of-interest concerns. That’s according to an investigative article from the Wall Street Journal that looked into the sources of Altman’s wealth, which the newspaper estimated at $2.8 billion. The paper identified nearly 400 companies in which Altman has investments, including stakes in Stripe and Airbnb. But it is Altman’s investments in some lesser-known startups, some of which are working in areas that intersect with OpenAI’s business interests or have done deals with the AI startup, where the potential for conflicts of interest poses a greater concern. The newspaper noted that many CEOs are barred from having substantial outside business interests and investments in order to avoid these sorts of conflicts. The paper also reported that Altman had pledged some of his equity in a portfolio of startups as collateral to secure debt from JP Morgan that he then used to make further startup investments, a potentially risky strategy.

EYE ON AI NUMBERS

32 million

That’s the amount of money that the tiny island nation of Anguilla made in 2023 from the huge growth in the number of people wanting to buy an internet domain address ending with .ai, which is a geographic domain registration owned by the British overseas territory. In 2022, Anguilla doled out 144,000 .ai registrations, but following the release of ChatGPT in November that year, domain registration purchases soared to 354,000 in 2023. The $32 million the island made off those sales constitutes 20% of its total revenue. You can read more from the IMF here.

FORTUNE ON AI

AI isn’t yet capable of snapping up jobs—except in these 4 industries, McKinsey says —by Jane Thier

AI is on track to ‘democratize financial planning.’ Are investors ready for that?—by Alicia Adamczyk

Super Micro rides the AI wave to a Fortune 500 debut —by Sharon Goldman

Satya Nadella has made Microsoft 10 times more valuable in his decade as CEO. Can he stay ahead in the AI age? —by Jeremy Kahn

AI CALENDAR

June 5: FedScoop’s FedTalks 2024 in Washington, D.C.

June 25-27: 2024 IEEE Conference onArtificialIntelligence in Singapore

July 15-17: Fortune Brainstorm Tech in Park City, Utah (register here)

July 30-31: Fortune Brainstorm AI Singapore (register here)

Aug. 12-14: Ai4 2024 in Las Vegas

BRAIN FOOD

Will licensing data to AI companies save publishers and media companies? That's the question a lot of people have been asking after a flurry of deals between such businesses and AI companies. The latest announcements came last week from The Atlantic and Vox, which both signed agreements to license their content to OpenAI. This follows news last month that News Corp. had reached a deal with the ChatGPT-maker that was worth $250 million over five years. Meanwhile, Shutterstock CEO Paul Hennessy told Bloomberg that licensing of his company’s stock images and videos to both Big Tech companies—including Meta, Alphabet, Amazon, and Apple—and AI startups, including OpenAI, had brought in $104 million in the past year. This has led many to speculate that AI will be the lifeline many media businesses need, particularly if the use of AI chatbots and generative AI search engines undercuts internet traffic and thus the advertising revenue many of these companies depend upon.

But Jessica Lessin, the founder of The Information, wrote an essay in The Atlantic arguing that media companies—well, news organizations in particular—are making a huge mistake by allowing AI firms to train on their data and access their content. In fact, she suggests they are repeating the same mistake they have made many times over in the past two decades, with the iPad, Google, and social media. She writes:

Chasing tech’s distribution and cash, news firms strike deals to try to ride out the next digital wave. They make concessions to platforms that attempt to take all of the audience (and trust) that great journalism attracts, without ever having to do the complicated and expensive work of the journalism itself. And it never, ever works as planned.

What do you think? Are publishing companies wisely taking advantage of a new revenue stream or sowing the seeds of their own destruction?

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.