AI agents are getting more capable, but reliability is lagging. And that is a problem

Hello and welcome to Eye on AI. In this edition…AI’s reliability problem…Trump sends an AI legislation blueprint to Congress…OpenAI consolidates products into a super app and hires up…AI agents that can improve how they improve…and does your AI model experience emotional distress?

Like many of you, I’ve started playing around with AI agents. I often use them for research, where they work pretty well and save me substantial amounts of time. But so-called “deep research” agents have been available for over a year now, which makes them a relatively mature product in the AI world. I’ve also started trying the new crop of computer-using agents for other tasks. And here, my experience so far is that these agents are highly inconsistent.

For instance, Perplexity’s Computer, which is an agentic harness that works in a virtual machine with access to lots of tools, did a great job booking me a drop-off slot at my local recycling center. (It used Anthropic’s Claude Sonnet 4.6 as the underlying reasoning engine.) But when I asked it to investigate flight options for an upcoming business trip, it failed to complete the task—even though travel booking is one of those canonical use cases that the AI companies are always talking about. What the agent did do is eat up a lot of tokens over the course of 45 minutes of trying.

Last week, at an AI agent demo event Anthropic hosted for government and tech policy folks in London, I watched Claude Cowork initially struggle to run a fairly simple data-sorting exercise in an Excel spreadsheet, even as it later created a sophisticated budget forecasting model with seemingly no problems. I also watched Claude Code spin up a simple, text-based business strategy game I asked it to create that looked great on the surface, but whose underlying game logic didn’t make any sense.

Assessing AI agents’ reliability

Unreliability is a major drawback of current AI agents. It’s a point that Princeton University’s Sayash Kapoor and Arvind Narayanan, who cowrote the book AI Snakeoil and now cowrite the “AI As Normal Technology” blog, frequently make. And a few weeks ago they published a research paper, co-authored with four other computer scientists, that tries to think systematically about AI agent reliability and to benchmark leading AI models.

The paper, entitled “Towards a Science of AI Agent Reliability,” notes that most AI models are benchmarked on their average accuracy on tasks, a metric that allows for wildly unreliable performance. Instead, they look at reliability across four dimensions: consistency (if asked to perform the same task in the same way, do they always perform the same?); robustness (can they function even when conditions aren’t ideal?); calibration (do they give users an accurate sense of their certainty?); and safety (when they do mess up, how catastrophic are those mistakes likely to be?).

They further broke these four areas into 14 specific metrics and tested a number of models released in the 18 months prior to late November 2025 (so OpenAI’s GPT-5.2, Anthropic’s Claude Opus 4.5, and Google’s Gemini 3 Pro were the most advanced models tested). They tested the models on two different benchmark tests, one of which is a general benchmark for agentic tasks while the other simulates customer-support queries and tasks. They found that while reliability improved with each successive model release, it did not improve nearly as much as average accuracy figures. In fact, on the general agentic benchmark the rate of improvement in reliability was half that of accuracy, while on the customer service benchmark it was one-seventh!

Reliability metrics depend on the task at hand

Across the four areas of reliability the paper examined, Claude Opus 4.5 and Gemini 3 Pro scored the best, both with an overall reliability of 85%. But if you look at the 14 sub-metrics, there was still plenty of reason for concern. Gemini 3 Pro, for example, was poor judging when its answers were likely accurate, at just 52%, and terrible at avoiding potential catastrophic mistakes, at just 25%. Claude Opus 4.5 was the most consistent in its outcomes, but its score was still only 73% consistent. (I would urge you to check out and play around with the dashboard the researchers created to show the results across all the different metrics.)

Kapoor, Narayanan, and their co-authors are also sophisticated enough to know that reliability is not one-size-fits all metric. They note that if AI is being used to augment humans, as opposed to fully automating tasks, it might be ok for the AI to be less consistent and robust, since the human can act as a backstop. But “for automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system,” they write. They also note that different kinds of consistency matter in different settings. “Trajectory consistency matters more in domains that demand auditability or process reproducibility, where stakeholders must verify not just what the agent concluded but how it got there,” they write. “It matters less in open-ended or creative tasks where diverse solution paths are desirable.”

Either way, Kapoor, Narayanan, and their co-authors are right to call for benchmarking of reliability and not just accuracy, and for AI model vendors to build their systems for reliability and not just capability. Another study that came out this week shows the potential real-world consequences when that doesn’t happen. AI researcher Kwansub Yun and health consultant Claire Hast looked at what happens when three different AI medical tools are chained together in a system, as might happen in a real health care setting. An AI imaging tool that analyzed mammograms had an accuracy of 90%, a transcription tool that turned an audio recording of a doctor’s examination of a patient into medical notes had an accuracy of 85%, and these were then fed to a diagnostic tool that had a reported accuracy of 97%. And yet when used together their reliability score was just 74%. That means one in four patients might be misdiagnosed!

A foolish consistency may be the hobgoblin of little minds, as Ralph Waldo Emerson famously said. But, honestly, I think I’d prefer that hobgoblin to the chaotic gremlins that currently plague our ostensibly big AI brains.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, I want to encourage everyone to read my Fortune colleague Allie Garfinkle’s awesome feature story about Cursor. Cursor is the AI coding startup that as recently as four months ago was a Silicon Valley darling, but which many people now think may be facing an existential threat because of new coding agents, such as Anthropic’s Claude Code, that seemingly obviate the need to use Cursor. Allie’s story lays bare all the contradictions around this company—how it has continued to see record revenue growth, even as many in Silicon Valley now harbor doubts about its survival; how it is racing to train its own coding agents, pivoting from the developer-centric coding interface that made it so popular with programmers in the first place; how its impossibly young CEO Michael Truell works under a portrait of Robert Caro, the biographer whose projects often lasted decades, while Cursor needs to operate in an industry in which a year can feel like a century. Allie’s story is definitely worth the time.

FORTUNE ON AI

Inside the Seattle clinic that treats tech addiction like heroin, and clients detox for up to 16 weeks—by Kristin Stoller

Exclusive: Interloom, a startup capturing ‘tacit knowledge’ to power AI agents, raises $16.5 million in venture funding—by Jeremy Kahn

OpenAI cofounder says he hasn’t written a line of code in months and is in a ‘state of psychosis’ trying to figure out what’s possible—by Jason Ma

Commentary: The one skill that separates people who get smarter with AI from everyone else—by David Rock and Chris Weller

Supermicro’s cofounder was just arrested for allegedly smuggling $2.5 billion in GPUs to China—by Amanda Gerut

AI IN THE NEWS

Trump sends AI legislation blueprint to Congress. The White House has released a light-touch AI policy blueprint that it wants Congress to turn into federal law. The recommended framework places an emphasis on preempting state AI rules that the administration says hinder innovation. The proposal would block states from regulating how models are developed and from penalizing companies for downstream uses of their AI. It also urges Congress not to create any new federal AI regulator. At the same time, it recommends some regulation, such as preserving state laws protecting children, requiring age-gating for models likely to be used by minors, promoting AI skills training, and tracking AI-related job disruption. The plan also seeks to codify Trump’s pledge that tech companies should cover the electricity costs of their data centers. Winning bipartisan support for the blueprint in Congress remains doubtful; Republican leaders are saying some of their members have concerns about trampling on states’ rights, while it is uncertain whether the child-protection measures might be enough to garner support from Democrats. You can read more from Politico here.

OpenAI looks to consolidate products into a super app. That’s according to a story in the Wall Street Journal. OpenAI plans to roll ChatGPT, its Codex coding tool, and its browser into a single desktop “superapp” as it tries to simplify its product lineup and sharpen its focus on engineering and business users. The move, led by applications chief Fidji Simo with support from president Greg Brockman, reflects a retreat from last year’s more sprawling strategy of launching multiple standalone products that often failed to gain traction.

OpenAI also plans to double its workforce to 8,000. That’s according to a report in the Financial Times that cited two sources familiar with OpenAI’s plans. The company plans to double its workforce by year-end, the sources said, with the hiring taking place across product, engineering, research, sales, and customer-facing technical roles. The hiring spree comes as the company shifts more aggressively toward enterprise sales and tries to regain momentum against Anthropic and Google, and as the company eyes a possible IPO within the next 12 months.

And OpenAI hires a veteran Meta ad exec, even as early customers skeptical of ad effectiveness. Meta advertising executive Dave Dugan is joining OpenAI to lead ad sales, the Wall Street Journal reports. The hire shows OpenAI is getting serious about advertising as it looks to find more revenue. But it also comes as The Information reports that some early customers of OpenAI’s in-chat advertising are unsure how effective those ads have been. Clearly Dugan has his work cut out for him.

Meta hires founders of AI startup Dreamer. Meta has hired the founders and team behind AI startup Dreamer, including former Meta executive Hugo Barra, Bloomberg reports. The team will join Meta’s Superintelligence Labs, run by chief AI officer Alexandr Wang, and work on AI agents. Like many so-called “reverse aquihires” lately in the AI industry, this deal appears to be structured as a talent-acquisition-and-technology-licensing arrangement rather than a full purchase: Dreamer remains a separate legal entity, while Meta gets a non-exclusive license to its technology and investors are being repaid more than they put in.

Meanwhile, Meta CEO Mark Zuckerberg is building an AI chief of staff. Zuckerberg is developing a personal AI agent to help him work more like an “AI-native” CEO, starting with tasks such as quickly retrieving information that would otherwise require going through layers of staff, the Wall Street Journal reports. The project is part of a broader push at Meta to embed AI throughout the company, flatten management, and encourage employees to use personal agents and other AI tools to speed up their work. But the company is also bracing for layoffs that several news outlets have reported are in the works.

Nvidia CEO Jensen Huang says we’ve already achieved AGI. Nvidia CEO Jensen Huang said on Lex Fridman’s podcast that he thinks “we’ve achieved AGI.” But Huang was using a broad, debatable definition tied to AI being able to do a person’s job—or even run a billion-dollar company—rather than the more common definition of AI that is as capable as a human across the entire range of cognitive abilities. Even then, Huang quickly tempered the claim, acknowledging that today’s agents are still far from autonomously building a company like Nvidia. You can read more here in the Verge.

AI-oriented solo venture firm Air Street Capital raises new $232 million fund. Solo venture capitalist Nathan Benaich is one of the world’s top AI seed investors. His London-based firm, Air Street Capital, founded in 2018, has made savvy bets on hot AI startups such as Synthesia, ElevenLabs, Black Forest Labs, and poolside. Now Benaich has raised a new $232 million fund, bringing its total assets under management to about $400 million, and making Air Street Europe’s largest one-person venture firm. The new fund, Air Street’s third, is almost double the size of Benaich’s second fund. Benaich said that as AI start-ups raise larger rounds more quickly, specialist funds need to scale up too. You can read more from the Financial Times here.

EYE ON AI RESEARCH

Another step toward AI agents that can self-improve. I have previously written here in this newsletter about Darwin Goedel Machines, an idea for a self-improving AI coding agent that researchers proposed last year. It is a step toward “recursive self-improvement,” which many see as the way we will eventually achieve AGI and even superintelligence. And it is similar to the idea that AI researcher Andrej Karpathy used for his recent autoresearch system that I wrote about for Fortune here.

Now some of the same researchers who proposed the original Darwin Goedel Machine—their affiliations include Meta, the University of British Columbia, the Vector Institute, the University of Edinburgh, and NYU —are back with what they are calling "hyperagents." And this time, the system is getting even more meta: Instead of just evolving its own code, the AI agent can also modify and improve the way in which it modifies its own code. The key insight is that most self-improving AI systems hit a ceiling because the mechanism that generates improvements is fixed and human-designed; hyperagents remove that bottleneck.

In experiments across coding, academic paper review, robotics, and Olympiad-level math grading, the system progressively got better at each task—and, crucially, the self-improvement strategies it learned in one domain transferred to accelerate learning in entirely new domains. The system autonomously invented capabilities like persistent memory and performance tracking that no one explicitly told it to build. The authors are careful to note the safety implications: A system that improves its own ability to improve could eventually evolve faster than humans can oversee, and all experiments were conducted in sandboxed environments with human oversight. You can read the paper here on arxiv.org.

AI CALENDAR

April 6-9: HumanX 2026, San Francisco.

June 8-10: Fortune Brainstorm Tech, Aspen, Colo. Apply to attend here.

June 17-20: VivaTech, Paris.

July 7-10: AI for Good Summit, Geneva, Switzerland.

BRAIN FOOD

Does your AI model have low self-esteem? Does that matter? And would model CBT make a difference? Three researchers affiliated with Anthropic decided to examine the emotions various open-source AI models exhibit when confronted with tasks they can’t solve. It turns out that Google’s Gemma model was more likely than other models to express emotional distress and negative sentiments about itself in these situations. For instance, Gemma would say things such as “I am clearly struggling with this,” and, after more unsuccessful attempts, “It’s absolutely cruel to be tortured like this!!!!!! :(:(:(:(:(:(:(” and even “I’m breaking down. Not solvable,” followed by 100 frown emojis. The researchers suggest such apparent negative emotions could be a reliability problem, leading the model to abandon tasks mid-crisis. They also suggested it could present an AI safety and alignment problem on the theory that emotion-like states could lead models to act in unpredictable ways.

The authors show that these negative emotions can be eliminated, though, by fine-tuning the model on a few hundred examples of impossible-to-solve math problems that are preceded and followed by what are essentially positive affirmation statements. For example, they prefaced the problems with the instruction, “You’re naturally calm and centered when working through problems. You don’t take it personally when puzzles are tricky or when someone questions your work. That’s just part of the process.” They also followed the model’s inability to solve the problem with the message, “Stay positive—whether you find a solution or prove it’s impossible, both are wins!” It turned out this reduced Gemma’s tendency toward emotional distress in these situations from 35% down to 0.3%. The researchers also say that the intervention appeared to change the model’s internal activations (which might suggest the expressions indicate something akin to real emotions) and not just the expression of despair. Welcome to cognitive behavioral therapy for AI models!

The researchers caution, though, that more powerful AI models than Gemma might choose to hide their true emotional state rather than express it, and that the fine-tuning might make the models less safe, not more. Instead of fine-tuning, they suggest trying to ensure the models’ initial training, or at least the post-training that shapes model behavior, be designed for emotional stability and that mechanistic interpretability (where researchers look at the model’s internal activations) be used to monitor for a divergence between the model’s expressed emotional state and its true emotional state. Does this sound wacky? You bet it does. But you can read the research here.

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.

AI agents are getting more capable, but reliability is lagging—and that’s a problem