The problem with 'human in the loop' AI? Often, it's the humans

Welcome to Eye on AI. In this edition…AI is outperforming some professionals…Google plans to bring ads to Gemini…leading AI labs team up on AI agent standards…a new effort to give AI models a longer memory…and the mood turns on LLMs and AGI.

Greetings from San Francisco, where we are just wrapping up Fortune Brainstorm AI. On Thursday, we’ll bring you a roundup of insights from the conference. But today, I want to talk about some notable studies from the past few weeks with potentially big implications for the business impact AI may have.

First, there was a study from the AI evaluations company Vals AI that pitted several legal AI applications as well as ChatGPT against human lawyers on legal research tasks. All of the AI applications beat the average human lawyers (who were allowed to use digital legal search tools) in drafting legal research reports across three criteria: accuracy, authoritativeness, and appropriateness. The lawyers’ aggregate median score was 69%, while ChatGPT scored 74%, Midpage 76%, Alexi 77%, and Counsel Stack, which had the highest overall score, 78%.

One of the more intriguing findings is that for many question types, it was the generalist ChatGPT that was the most accurate, beating out the more specialized applications. And while ChatGPT lost points for authoritativeness and appropriateness, it still topped the human lawyers across those dimensions.

The study has been faulted for not testing some of the better-known and most widely adopted legal AI research tools, such as Harvey, Legora, CoCounsel from Thompson Reuters, or LexisNexis Protégé, and for only testing ChatGPT among the frontier general-purpose models. Still, the findings are notable and comport with what I’ve heard anecdotally from lawyers.

A little while ago I had a conversation with Chris Kercher, a litigator at Quinn Emanuel who founded that firm’s data and analytics group. Quinn Emanuel has been using Anthropic’s general purpose AI model Claude for a lot of tasks. (This was before Anthropic’s latest model, Claude Opus 4.5, debuted.) “Claude Opus 3 writes better than most of my associates,” Kercher told me. “It just does. It is clear and organized. It’s a great model.” He said he is “constantly amazed” by what LLMs can do, finding new issues, strategies, and tactics that he can use to argue cases.

Kercher said that AI models have allowed Quinn Emanuel to “invert” its prior work processes. In the past, junior lawyers—who are known as associates—used to spend days researching and writing up legal memos, finding citations for every sentence, before presenting those memos to more senior lawyers who would incorporate some of that material into briefs or arguments that would actually be presented in court. Today, he says, AI is used to generate drafts that Kercher said are by and large better, in a fraction of the time, and then these drafts are given to associates to vet. The associates are still responsible for the accuracy of the memos and citations—just as they always were—but now they are fact-checking the AI and editing what it produces, not performing the initial research and drafting, he said.

He said that the most experienced, senior lawyers often get the most value out of working with AI, because they have the expertise to know how to craft the perfect prompt, along with the professional judgment and discernment to quickly assess the quality of the AI’s response. Is the argument the model has come up with sound? Is it likely to work in front of a particular judge or be convincing to a jury? These sorts of questions still require judgment that comes from experience, Kercher said.

Ok, so that’s law, but it likely points to ways in which AI is beginning to upend work within other “knowledge industries” too. Here at Brainstorm AI yesterday, I interviewed Michael Truell, the cofounder and CEO of hot AI coding tool Cursor. He noted that in a University of Chicago study looking at the effects of developers using Cursor, it was often the most experienced software engineers who saw the most benefit from using Cursor, perhaps for some of the same reasons Kercher says experienced lawyers get the most out of Claude—they have the professional experience to craft the best prompts and the judgment to better assess the tools’ outputs.

Then there was a study out on the use of generative AI to create visuals for advertisements. Business professors at New York University and Emory University tested whether advertisements for beauty products created by human experts alone, created by human experts and then edited by AI models, or created entirely by AI models were most appealing to prospective consumers. They found the ads that were entirely AI generated were chosen as the most effective—increasing clickthrough rates in a trial they conducted online by 19%. Meanwhile, those created by humans and edited by AI were actually less effective than those simply created by human experts with no AI intervention. But, critically, if people were told the ads were AI-generated, their likelihood of buying the product declined by almost a third.

Those findings present a big ethical challenge to brands. Most AI ethicists think people should generally be told when they are consuming content generated by AI. And advertisers do need to negotiate various Federal Trade Commission rulings around “truth in advertising.” But many ads already use actors posing in various roles without needing to necessarily tell people that they are actors—or the ads do so only in very fine print. How different is AI-generated advertising? The study seems to point to a world where more and more advertising will be AI-generated and where disclosures will be minimal.

The study also seems to challenge the conventional wisdom that “centaur” solutions (which combine the strengths of humans and those of AI in complementary ways) will always perform better than either humans or AI alone. (Sometimes this is condensed to the aphorism “AI won’t take your job. A human using AI will take your job.”) A growing body of research seems to suggest that in many areas, this simply isn’t true. Often, the AI on its own actually produces the best results.

But it is also the case that whether centaur solutions work well depends tremendously on the exact design of the human-AI interaction. A study on human doctors using ChatGPT to aid diagnosis, for example, found that humans working with AI could indeed produce better diagnoses than either doctors or ChatGPT alone—but only if ChatGPT was used to render an initial diagnosis and human doctors, with access to the ChatGPT diagnosis, then gave a second opinion. If that process was reversed, and ChatGPT was asked to render the second opinion on the doctor’s diagnosis, the results were worse—and in fact, the second-best results were just having ChatGPT provide the diagnosis. In the advertising study, it would have been good if the researchers had looked at what happens if AI generates the ads and then human experts edit them.

But in any case, momentum towards automation—often without a human in the loop—is building across many fields.

On that happy note, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

FORTUNE ON AI

Exclusive: Glean hits $200 million ARR, up from $100 million 9 months back —by Allie Garfinkle

Cursor developed an internal AI help desk that handles 80% of its employees’ support tickets, says the $29 billion startup’s CEO —by Beatrice Nolan

HP’s chief commercial officer predicts the future will include AI-powered PCs that don’t share data in the cloud —by Nicholas Gordon

How Intuit’s chief AI officer supercharged the company’s emerging technologies teams—and why not every company should follow his lead —by John Kell

Google Cloud CEO lays out 3-part strategy to meet AI’s energy demands, after identifying it as ‘the most problematic thing’ —by Jason Ma

OpenAI COO Brad Lightcap says code red will ‘force’ the company to focus, as the ChatGPT maker ramps up enterprise push —by Beatrice Nolan

AI IN THE NEWS

Trump allows Nvidia to sell H200 GPUs to China, but China may limit adoption. President Trump signaled he would allow exports of Nvidia’s high-end H200 chips to approved Chinese customers. Nvidia CEO Jensen Huang has called China a $50 billion annual sales opportunity for the company, but Beijing wants to limit the reliance of its companies on U.S.-made chips, and Chinese regulators are weighing an approval system that would require buyers to justify why domestic chips cannot meet their needs. They may even bar the public sector from purchasing H200s. But Chinese companies often prefer to use Nvidia chips and even train their models outside of China to get around U.S. export controls. Trump’s decision has triggered political backlash in Washington, with a bipartisan group of senators seeking to block such exports, though the legislation’s prospects remain uncertain. Read more from the Financial Times here.

Trump plans executive order on national AI standard, aimed at pre-empting state-level regulation. President Trump said he will issue an executive order this week creating a single national artificial-intelligence standard, arguing that companies cannot navigate a patchwork of 50 different state approval regimes, Politico reported. The move follows a leaked November draft order that sought to block state AI laws and reignited debate over whether federal rules should override state and local regulations. A previous attempt to add AI-preemption language to the year-end defense bill collapsed last week, prompting the administration to return to pursuing the policy through executive action instead.

Google plans to bring advertising to its Gemini chatbot in 2026. That’s according to a report in Adweek that cited information from two unnamed Google advertising clients. The story said that details on format, pricing, and testing remained unclear. It also said the new ad format for Gemini is separate from ads that will appear alongside “AI Mode” searches in Google Search.

Former Databricks AI head's new AI startup valued at $4.5 billion in seed round. Unconventional AI, a startup cofounded by former Databricks AI head Naveen Rao, raised $475 million in a seed round led by Andreessen Horowitz and Lightspeed Venture Partners at a valuation of $4.5 billion—just two months after its founding, Bloomberg News reported. The company aims to build a novel, more energy-efficient computing architecture to power AI workloads.

Anthropic forms partnership with Accenture to target enterprise customers. Anthropic and Accenture have formed a three-year partnership that makes Accenture one of Anthropic’s largest enterprise customers and aims to help businesses—many of which remain skeptical—realize tangible returns from AI investments, the Wall Street Journal reported. Accenture will train 30,000 employees on Claude and, together with Anthropic, launch a dedicated business group targeting highly regulated industries and embedding engineers directly with clients to accelerate adoption and measure value.

OpenAI, Anthropic, Google, and Microsoft team up for new standard for agentic AI. The Linux Foundation is organizing a group called the Agentic Artificial Intelligence Foundation with participation from major AI companies, including OpenAI, Anthropic, Google, and Microsoft. It aims to create shared open-source standards that allow AI agents to reliably interact with enterprise software. The group will focus on standardizing key tools such as the Model Context Protocol, OpenAI’s Agents.md format, and Block’s Goose agent, aiming to ensure consistent connectivity, security practices, and contribution rules across the ecosystem. CIOs increasingly say common protocols are essential for fixing vulnerabilities and enabling agents to function smoothly in real business environments. Read more here from The Information.

EYE ON AI RESEARCH

Google has created a new architecture to give AI models longer-term memory. The architecture, called Titans—which Google first debuted at the start of 2025 and which Eye on AI covered at the time—is paired with a framework named MIRAS that is designed to give AI something closer to long-term memory. Instead of forgetting older details when its short memory window fills up, the system uses a separate memory module that continually updates itself. The system assesses how surprising any new piece of information is compared to what it has stored in its long-term memory, updating the memory module only when it encounters high surprise. In testing, Titans with MIRAS performed better than older models on tasks that require reasoning over long stretches of information, suggesting it could eventually help with things like analyzing complex documents, doing in-depth research, or learning continuously over time. You can read Google’s research blog here.

AI CALENDAR

Jan. 6: Fortune Brainstorm Tech CES Dinner. Apply to attend here.

Jan. 19-23: World Economic Forum, Davos, Switzerland.

Feb. 10-11: AI Action Summit, New Delhi, India.

BRAIN FOOD

At NeurIPS, the mood shifts against LLMs as a path to AGI. The Information reported that a growing number of researchers attending NeurIPS, the AI research field’s most important conference—which took place last week in San Diego (with satellite events in other cities)—are increasingly skeptical of the idea that large language models (LLMs) will ever lead to artificial general intelligence (AGI). Instead, they feel the field may need an entirely new kind of AI architecture to advance to more human-like AI that can continually learn, can learn efficiently from fewer examples, and can extrapolate and analogize concepts to previously unseen problems.

Figures such as Amazon’s David Luan and OpenAI co-founder Ilya Sutskever contend that current approaches, including large-scale pre-training and reinforcement learning, fail to produce models that truly generalize, while new research presented at the conference explores self-adapting models that can acquire new knowledge on the fly. Their skepticism contrasts with the view of leaders like Anthropic CEO Dario Amodei and OpenAI’s Sam Altman, who believe scaling current methods can still achieve AGI. If critics are correct, it could undermine billions of dollars in planned investment in existing training pipelines.

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.