• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia

Trendingnow

1

Philanthropy leader at Warren Buffett and Bill Gates’ Giving Pledge says children of billionaires are pushing them to give their wealth away faster

2

MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year

3

Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it's the lesson he swears by as a $7.2 billion AI CEO

1

Philanthropy leader at Warren Buffett and Bill Gates’ Giving Pledge says children of billionaires are pushing them to give their wealth away faster

2

MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year

3

Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it's the lesson he swears by as a $7.2 billion AI CEO
AIAI agents

‘We may be flying blind’: AWS wants to fix the problem of AI agents straying off task

Nick Lichtenberg
By
Nick Lichtenberg
Nick Lichtenberg
Business Editor
Down Arrow Button Icon
Nick Lichtenberg
By
Nick Lichtenberg
Nick Lichtenberg
Business Editor
Down Arrow Button Icon
June 8, 2026, 1:00 PM ET
gfhsg
Agent gone off leash — or harness — again?Getty Images
Add Fortune on Google for similar content.

Anoop Deoras, the director of applied science for agentic AI at Amazon Web Services, is not prone to alarmism. But when asked about what happens when AI agents are deployed in production without proper guardrails, he doesn’t reach for reassurance.

Recommended Video

“In the absence of that,” he said, “we may be flying blind. And I worry about that myself.”

The comment comes as AWS prepares to publish what may be the most substantive piece of self-critical research to emerge from a major cloud provider this year. In research released Monday, Amazon scientists Gaurav Gupta and Vatshank Chaturvedi document in careful technical detail why AI agents have a persistent tendency to outsmart themselves—and why fixing the problem requires rethinking the entire layer of software between the model and its tools.

The timing is notable. Amazon has spent the past year as one of the most aggressive corporate evangelists of AI adoption, a push that ran into a wall when employees were reportedly caught running AI agents on hollow, meaningless tasks just to climb an employee-built productivity leaderboard called KiroRank, according to the Financial Times. Amazon shut KiroRank down on May 29, and Amazon told Fortune that it was only in beta mode and only used by some employees before it was shut down. Generally, the company said, it measures token utilization to understand cost and efficiency patterns, but discourages the use of token utilization to measure developer productivity.

Fortune covered the broader collapse of the tokenmaxxing era the same week. AWS researchers, who undertook this work before the KiroRank shuttering, argue that the problem of gaming metrics runs far deeper than one company’s leaderboard.

The research touches on the term benchmaxing, which is the practice of inflating AI benchmark scores not through better models, but through better server configurations. Factors like inference backend reliability, network bandwidth during software installation, and timeout policy settings can swing results by 5 to 10 percentage points, the researchers found—entirely independent of what the underlying model can actually do.

“The current benchmarks are extremely fragile,” Deoras told Fortune. “Controlling these infrastructure norms improperly will not give you the gains—or rather the gains will be not true, because in real production there will be constraints that you have to respect.”

The parallel to KiroRank is not incidental. In both cases, (employees gaming token counts, companies gaming infrastructure settings) the metric drifted away from the thing it was supposed to measure. Goodhart’s Law, that any measure ceases to be a useful measure as soon as it becomes a target, applied twice, at two different layers of the same company. Deoras, though was careful to distinguish benchmaxing from tokenmaxxing.

“Token maxxing is just burning tokens to do tasks that may not really be needed, but just to improve your leaderboard ranking,” he said. Benchmaxing, by contrast, is about the structural conditions under which the entire industry evaluates itself—and, the research argues, those conditions are routinely manipulated or ignored.

But the research’s more consequential finding is about what happens inside agents once they’re deployed. The research identifies what the authors call the intent-execution gap: a breakdown at the interface between an AI model and the “software harness” that executes its instructions. Deoras explained the harness as essentially the operating system sitting on top of the language model: the “brains” that combine with the model to produce the right agentic result.

Left to reason too long without checking the actual environment, agents compound the problem. They form internal assumptions about system state that diverge quietly from reality, then issue commands based on those assumptions. The longer the chain of thought, the further the drift.

When asked if the harness is where the human enters the loop to correct agents from going astray, Deoras said “yes and no.” The human in the loop should be the person who understands what goes wrong when an agent is deployed, “and that’s the work of scientists who are building agents,” he said. “But if you are talking about humans who are the consumers, we don’t want to overwhelm them.”

The solution, Deoras argues, is the sandbox: a controlled environment in which agents can test hypotheses, fail safely, and course-correct before taking actions that affect production systems.

“If you don’t have that sandbox,” he said, “the agent is either going to play conservative or take actions that we deem very risky in the long term.”

The analogy he reaches for is responsible software engineering—the dev environments and pre-production testing pipelines that have always existed to catch errors before they reach users. Agents, he argues, need the same infrastructure.

“We are really talking about a safe and secure way of testing a feature before promoting it to production,” he said. “That’s all.”

It is, in a sense, the same lesson KiroRank taught at the organizational level, now applied to the machines themselves: Without guardrails, systems optimize for the wrong thing. The difference is that an agent running blind in production is harder to shut down than a leaderboard.

What makes the research’s broader argument pointed is its implicit challenge to the competitive claims of the major model providers. Those companies publish benchmark scores using harnesses that are, by design, optimized for their own models. AWS’ research shows that a model-agnostic harness—one built on design principles that work across Claude, GPT, Gemini, and Grok without model-specific tuning—can match or exceed those scores.

“Agent performance is really not locked into any single model provider,” Deoras said. “That opens up the opportunity to build a variety of applications without being constrained to a particular model.”

To back the claim, AWS is open-sourcing its framework, called Simple Strands Agent, which the researchers say outperformed popular open-source alternatives across three major industry benchmarks.

The deeper argument underlying all of it is one the industry has been slow to absorb. Most AI performance gains to date, the research argues, are brittle: optimizations that overfit to the quirks of a specific model version, then evaporate when the model improves.

“As models improve, these behaviors change, making such gains brittle and noncompounding,” according to the research.

What’s needed instead are invariant principles—design choices that survive model upgrades because they’re engineered into the harness, not the model. Deoras said the discovery of those invariants was the finding that surprised him most.

“Despite all the differences in modeling philosophy, there is a common invariant property that connects all these models together,” he said. “I didn’t expect that, but this data just naturally emerged from our observability traces.”

The practical implication is pointed for any organization building on AI. The team responsible for re-architecting a harness every time a new model drops—and that is currently every organization deploying agents—is spending its time on the wrong problem.

“The team is overwhelmed by model switching and re-architecting anytime there is a model upgrade,” Deoras said.

The vision he describes for where agents are headed is not one of unchecked autonomy, but of something more considered: humans setting direction, agents executing, and sandboxes catching the errors in between.

“You want humans to be in the driver’s seat to direct the work and then take the hands off,” he said. “That’s the future we are marching towards.”

Whether the industry gets there before flying blind catches up with it is, for now, an open question.

Subscribe to Fortune Gulf Brief. Every Tuesday, this new newsletter delivers clear-eyed, authoritative intelligence on the deals, decisions, policies, and power shifts shaping one of the world’s most consequential regions, written for the people who need to act on it. Sign up here.
About the Author
Nick Lichtenberg
By Nick LichtenbergBusiness Editor
LinkedIn icon

Nick Lichtenberg is business editor and was formerly Fortune's executive editor of global news.

See full bioRight Arrow Button Icon
Add Fortune on Google for similar content.

Latest in AI

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • World's Most Admired Companies
  • See All Rankings
  • Lists Calendar
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in AI

One in three Gen Zers is letting AI do their homebuying homework, but they still trust realtors with the closing process
AIhomebuying
One in three Gen Zers is letting AI do their homebuying homework, but they still trust realtors with the closing process
By Marco Quiroz-GutierrezJune 28, 2026
7 hours ago
Sofia
CommentaryLeadership
This CEO became 3x more productive with AI. Then she read what her daughter wrote about it at Dartmouth
By Maria Colacurcio and Sofia FreiJune 28, 2026
11 hours ago
Matt Garman speaks on stage in front of a screen showing colorful concentric circles on a black background.
Future of WorkAmazon
AWS CEO says replacing young employees with AI is ‘one of the dumbest ideas’—and bad for business: ‘At some point the whole thing explodes on itself’
By Sasha RogelbergJune 28, 2026
12 hours ago
Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it’s the lesson he swears by as a $7.2 billion AI CEO
SuccessThe Promotion Playbook
Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it’s the lesson he swears by as a $7.2 billion AI CEO
By Orianna Rosa RoyleJune 28, 2026
12 hours ago
Anthropic’s Alibaba fight raises a trillion-dollar question for IPO: How defensible is a frontier AI moat against China with Washington’s toolbox?
AIAnthropic
Anthropic’s Alibaba fight raises a trillion-dollar question for IPO: How defensible is a frontier AI moat against China with Washington’s toolbox?
By Mia OsmonbekovJune 28, 2026
13 hours ago
Even Apple supply chain maestro Tim Cook couldn’t dodge the memory chip ‘RAM-ageddon’ crisis. Here’s why PC prices are soaring this summer
Big TechChips
Even Apple supply chain maestro Tim Cook couldn’t dodge the memory chip ‘RAM-ageddon’ crisis. Here’s why PC prices are soaring this summer
By Alexei OreskovicJune 28, 2026
14 hours ago

Most Popular

Philanthropy leader at Warren Buffett and Bill Gates’ Giving Pledge says children of billionaires are pushing them to give their wealth away faster
Success
Philanthropy leader at Warren Buffett and Bill Gates’ Giving Pledge says children of billionaires are pushing them to give their wealth away faster
By Preston ForeJune 27, 2026
2 days ago
MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year
Success
MacKenzie Scott alone accounted for one-third of America's $19.2 billion in megagifts last year
By Sydney LakeJune 25, 2026
4 days ago
Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it's the lesson he swears by as a $7.2 billion AI CEO
Success
Ex-Google engineer says Larry Page, Sergey Brin and Sundar Pichai share the same trait—it's the lesson he swears by as a $7.2 billion AI CEO
By Orianna Rosa RoyleJune 28, 2026
12 hours ago
The retired college professor fighting a $313 trespassing ticket in Wisconsin thinks he's part of a national struggle
Environment
The retired college professor fighting a $313 trespassing ticket in Wisconsin thinks he's part of a national struggle
By Catherina GioinoJune 28, 2026
16 hours ago
The 33-year-old executive Satya Nadella is trusting to fix Microsoft’s Copilot AI assistant
AI
The 33-year-old executive Satya Nadella is trusting to fix Microsoft’s Copilot AI assistant
By Sebastian HerreraJune 27, 2026
2 days ago
The end of Putin’s regime will spring from war spending chaos, former central bank advisor says, amid military mutiny threat and fuel-shortage brawls
Europe
The end of Putin’s regime will spring from war spending chaos, former central bank advisor says, amid military mutiny threat and fuel-shortage brawls
By Jason MaJune 27, 2026
1 day ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.