AI agents promise to work while you sleep. The reality is far messier

Summer Yue may work on safety and alignment on Meta’s superintelligence team, but even she admits she isn’t immune to overconfidence when it comes to autonomous AI agents.

In a post on X Monday, Yue described how her OpenClaw autonomous AI agents—built to run locally on a Mac mini computer—deleted her entire inbox, ignoring instructions to pause and ask for confirmation first.

“I had to RUN to my Mac Mini like I was defusing a bomb,” she said. It was, she added, a “rookie mistake.” The workflow had been working in a test inbox she used to safely trial the agent for weeks, she explained, but in the real inbox the agent lost her original instruction.

Yue’s experience stands in stark contrast to viral posts such as The Lobster Revolution: Why 24/7 AI Agents Just Changed Everything, in which Peter Diamandis claims always-on AI is far more frictionless.

“Let me tell you what it feels like to use this,” Diamandis wrote. “You wake up in the morning and your agent—mine is named Skippy, cheerfully sarcastic and absurdly capable—has done eight hours of work while you slept. It read a thousand pages of markdown. It organized your files. It drafted three project plans. It booked your travel. It researched that question you had at 11 PM and forgot about.

“When my Mac mini went offline for six hours, I felt withdrawal,” he added. “Like my best friend disappeared.”

Together, these dueling accounts of the power of AI agents capture the tension at the heart of today’s push toward “always-on” AI. As tools like OpenClaw and Claude Code make it technically possible for agents to run for long periods, excitement is growing around the idea of AI that works while you sleep. But in practice, early users say that autonomy remains fragile, unpredictable, and labor-intensive to manage. Rather than replacing human work, today’s agents often require constant monitoring, guardrails, and intervention, especially when the stakes rise beyond low-risk experiments.

AI agents work best when tasks are simple and low-stakes

Shyamal Anadkat, who previously worked as an applied AI engineer at OpenAI, said most of today’s successful agents still require frequent human check-ins or are limited to tightly bounded, well-defined tasks—though he emphasized that this will change as measurement and evaluation techniques improve.

“A system that’s 95% accurate on individual steps becomes chaotic over a 20-step autonomous workflow,” Anadkat said. “Long-horizon planning is still weak.” As a result, he explained, agents may perform well on short task chains but tend to fall apart when asked to manage complex, multiday projects. Memory is another major limitation: “In many agents, memory is either nonexistent or fragile. You need systems that can maintain a coherent model of your work context, priorities, and constraints.”

That doesn’t mean the promise of AI agents is all smoke and mirrors, according to Yoav Shoham, a former principal scientist at Google, a professor emeritus at Stanford, and cofounder of AI21 Labs. But it does mean there is the danger of people getting ahead of themselves. Today’s AI agents, he explained, work best when the task is low-risk, loosely defined, and cheap to get wrong.

“Developers like toys, and you have this toy that can do wonderful things,” he told Fortune. “As long as what they’re doing is fairly simple and fairly low-stakes with high tolerance for error, that’s fine.” For example, if you wanted your agent to read 10,000 websites and do something interesting with the results to give you tidbits of information overnight that could be useful.

But for mission-critical enterprise workflows, the bar is much higher. Companies need systems that are verifiable, repeatable, and cost-effective—requirements that quickly erode the set-it-and-forget-it promise of fully autonomous, always-on agents. In highly structured domains like coding or math, deeper automation is already possible. But for most real-world business processes, Shoham says, the work required to make agents reliable often outweighs the benefit.

Bret Greenstein, chief AI officer at consulting firm West Monroe, pointed out that tools like OpenClaw feel like a tipping point similar to what happened with generative AI when ChatGPT launched in 2022—for the first time, it has made the idea of AI agents accessible. Still, it’s not a 24/7 “magic solution.”

“It can work for a long time, cranking away on things, but it’s like a toddler that needs to be overseen,” he said. Some tasks are reasonable to do while you are sleeping, like scanning LinkedIn messages or tracking news. “I’m not sure I would have it answering customer feedback while I’m sleeping,” he said.

Ability to delegate to an AI agent feels powerful

Still, there is little doubt that the ability to delegate real-world tasks to an AI agent is deeply compelling for users, Greenstein emphasized. He pointed to his own experience handing an AI agent the mundane task of getting his clothes picked up to be dry-cleaned—and watching it quietly complete the job end to end.

The agent independently contacted the cleaner, worked out pickup logistics through email exchanges, coordinated timing, monitored a doorbell camera to confirm the pickup, and notified Greenstein once the task was complete. The episode illustrated how agents can operate across multiple systems and adapt when things don’t go as planned. But it also underscored why such tools still require strict guardrails and oversight—especially before they are deployed in enterprise settings.

“OpenClaw is set up so it shouldn’t feel safe for most people,” Greenstein said. “It doesn’t feel mature enough to be a trusted part of our lives yet.” For AI to be welcomed into everyday life or business operations, he added, it has to earn trust over time—much the way trust is established socially.

Even so, demand is already evident. Greenstein pointed to meetups and early industry gatherings dedicated to OpenClaw, a rapid emergence he described as unusual for such a young tool. “It shows the hunger people have for AI that’s actually useful,” he said—systems that move beyond answering questions and start taking action.

Aaron Levie, CEO of cloud-based content management and collaboration company Box, called what is happening now with AI agents “little glimmers” of what might happen in the future.

“Some glimmers end up not manifesting, some glimmers just become the standard,” he explained, pointing to two years ago when AI company Cognition introduced an early agent called Devin that would integrate with Slack for task delegation, bug fixes, data analysis, and code review. At the time, it was still seen as futuristic, but today, “no one is confused that this is a standard practice,” he said. “You can just Slack Claude Code to go work on stuff—what seemed like a totally crazy idea is now basically the standard of any modern engineering team.”

But while AI agents are becoming very good at automating specific, discrete tasks, they remain poor at handling the broader, context-heavy work that makes up most jobs, Levie emphasized. AI agents may fully automate a handful of tasks, but struggle with the rest—including navigating relationships and participating in meetings.

“When you hear an AI lab say we’re going to automate all knowledge work in 24 months, that’s usually a very narrow definition of jobs,” he said. “The definition of what an agent can do is not the same definition of what the job is that gets hired in the economy.”

The trust factor matters for when things can go wrong

Avinash Vootkuri, a staff data scientist at a top Fortune 500 retailer, said that most enterprise AI agents “absolutely require a babysitter” and, for now, can work only in enterprise settings with tightly bounded autonomy and extensive guardrails. “The stakes are massive,” he explained.

For example, he described building an agentic system for enterprise cybersecurity where AI agents don’t simply trigger alerts and wait for human review but actively investigate them. Instead of flooding analysts with thousands of warnings, the agents gather evidence in real time—querying threat-intelligence databases, analyzing behavioral patterns, and filtering out false positives—before deciding whether a situation warrants escalation.

The system relies on tightly bounded autonomy and extensive guardrails, reducing human workload without removing oversight.

In cybersecurity, he explained, if the agent gets it wrong, the consequences are immediate and severe. “The AI either blocks legitimate customers (causing massive revenue loss) or it lets a sophisticated threat actor into the network,” he said. “It absolutely matters if things go wrong.”

According to Breeanna Whitehead, who runs an AI operations consultancy where she builds AI-powered systems for executives and founders, the industry is in a “trust calibration phase.”

AI agents can do more than most people let them, but less than the hype suggests.

“The real skill isn’t building the agent—it’s designing the handoff,” she explained. “Most people either over-trust agents and end up cleaning up messes, or they micromanage every output and wonder why AI feels like more work instead of less.” The idea, she said, is to design clear handoff points, where something might be fully delegated, another thing might get a quick review, while another task stays just for humans to do.

For now, she said, agents are “genuinely excellent” what she called the middle layer of knowledge work—“the stuff that used to eat two to three hours of a smart person’s day, like synthesizing meeting notes into action items, drafting follow-up emails in someone’s voice, pulling together research briefs, organizing competing priorities into a clear plan.”

But anything that requires reading a room, navigating ambiguity, or making judgment calls that depend on relationships are not ready for AI agent prime time. “I had a client who wanted to fully automate their investor communications,” she said. “The AI could draft beautifully, but it couldn’t sense when a funder was losing interest and needed a different approach. The agent drafted the email, but the human had to decide whether to send it.”

For now, sleep may be elusive when working with AI agents

For now, working with AI agents may have less to do with sleeping while they work than with staying half-awake while they do. Tools like OpenClaw can run for hours at a time, but for many early users, that autonomy comes with a new kind of vigilance—checking logs, reviewing outputs, and stepping in before things go wrong.

That dynamic was captured in a recent viral post titled Token Anxiety, in which investor Nikunj Kothari described a friend leaving a party early—not because he was tired, but because he wanted to get back to his agents. “Nobody questions it anymore,” Kothari wrote. “Half the room is thinking the same thing. The other half are probably checking the progress of their agents. At a party.”

The dream of AI that works while you sleep may be real. But for now, it’s still keeping a lot of people awake.

In 2001, Fortune first convened “The Smartest People We Know,” bringing together CEOs and founders, builders and investors, thinkers and doers. Since then, Fortune Brainstorm Tech has been the place where bold ideas collide. From June 8–10, we will return to Aspen—where it all began—to mark 25 years of Brainstorm. Register now.