• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechAI

Anthropic makes a breakthrough in opening AI’s ‘black box’

Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
March 27, 2025, 1:00 PM ET
Anthropic CEO Dario Amodei raising his index finger.
Anthropic CEO Dario Amodei. Today the company announced that its researchers had made a breakthrough in probing how large language models, like the one that powers Anthropic's Claude chatbot, formulate responses.FABRICE COFFRINI/AFP—Getty Images

Researchers at the AI company Anthropic say they have made a fundamental breakthrough in our understanding of exactly how large language models, the type of AI responsible for the current boom, work. The breakthrough has important implications for how we may be able to make AI models safer, more secure, and more reliable in the future.

Recommended Video

One of the problems with today’s powerful AI that is based around large language models (LLMs) is that the models are black boxes. We can know what prompts we feed them and what output they produce, but exactly how they arrive at any particular response is a mystery, even to the AI researchers who build them.

This inscrutability creates all kinds of issues. It makes it difficult to predict when a model is likely to “hallucinate,” or confidently spew erroneous information. We know these large AI models are susceptible to various jailbreaks where they can be tricked into jumping guardrails (the limits the AI model developers try to put around a model’s outputs so that it doesn’t use racist language or write malware for someone or tell them how to build a bomb). But we don’t understand why some jailbreaks work better than others, or why the fine-tuning that is used to create the guardrails doesn’t result in strong enough inhibitions to prevent the models doing stuff their developers don’t want them to.

Our inability to understand how LLMs work has made some businesses hesitant to use them. If the models’ inner workings were more understandable, it might give companies more confidence to use the models more widely.

There are implications for our ability to retain control of increasingly powerful AI “agents” too. We know these agents are capable of “reward hacking”—finding ways to achieve a goal that were not what a user of the model intended. In some cases the models can be deceptive, lying to users about what they have done or are trying to do. And while the recent “reasoning” AI models produce what’s known as a “chain of thought”—a kind of plan for how to answer a prompt that involves what looks to a human like “self-reflection”—we don’t know if the chain of thought the model outputs accurately represents the steps it is taking (and there’s often evidence it might not be.)

Anthropic’s new research offers a pathway to solve at least some of these problems. Its scientists created a new tool for deciphering how LLM’s “think.” In essence, what the Anthropic researchers built is a bit like the fMRI scans neuroscientists use to scan the brains of human research subjects and uncover which brain regions seem to play the biggest role in different aspects of cognition. Having invented this fMRI-like tool, Anthropic then applied it to Anthropic’s Claude 3.5 Haiku model. Doing so, they were able to resolve several key questions about how Claude, and probably most other LLMs, work.

The researchers found that although LLMs like Claude are initially trained to just predict the next word in a sentence, in the process Claude does learn to do some longer-range planning, at least when it comes to certain kinds of tasks. For instance, when asked to write a poem, Claude finds words that make sense with the poem’s topic or theme that it wants to rhyme and then works backward to construct sentences that will end with those rhyming words.

They also found that Claude, which is trained to be multilingual, doesn’t have completely separate components for reasoning in each language. Instead, concepts that are common across languages are embedded in the same set of neurons within the model and the model seems to “reason” in this conceptual space and only then convert the output to the appropriate language.

The researchers also discovered that Claude is capable of lying about its chain of thought in order to please a user. The researchers showed this by asking the model a tough math problem, but then giving the model an incorrect hint about how to solve it.

In other cases, when asked an easier question that the model can answer more or less instantly, without having to reason, the model makes up a fictitious reasoning process. “Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of this having occurred,” Josh Batson, an Anthropic researcher who worked on the project.

The ability to trace the internal reasoning of LLMs opens new possibilities for auditing AI systems for security and safety concerns. It also may help researchers develop new training methods to improve the guardrails that AI systems have and to reduce hallucinations and other faulty outputs. 

Some AI experts dismiss LLM’s “black box problem” by saying that human minds are also frequently inscrutable to other humans and yet we depend on humans all the day. We can’t really tell what someone else is thinking—and in fact, psychologists have shown that sometimes we don’t even understand how our own thinking works, making up logical explanations after-the-fact to justify actions that we make either intuitively or largely due to emotional responses of which we may not even be conscious. We often wrongly assume that another person thinks more or less the same way we do—which can lead to all kinds of misunderstandings. But it also seems true that, very broadly speaking, humans do tend to think in somewhat similar ways, and that when we make mistakes, these mistakes fall into somewhat familiar patterns. (It’s the reason psychologists have been able to identify so many common cognitive biases.) Yet the issue with LLMs is that the way they arrive at outputs seems alien enough to how humans perform the same tasks that they can fail in ways that it would be highly unlikely for a human to.

Batson said that thanks to the kinds of techniques that he and other scientists are developing to probe these alien LLM brains—a field known as “mechanistic interpretability”—rapid progress is being made. “I think in another year or two, we’re going to know more about how these models think than we do about how people think,” he said. “Because we can just do all the experiments we want.”

Previous techniques for trying to probe how an LLM works focused on either trying to decipher individual neurons or small clusters of neurons within the neural network, or asking layers of the neural network that sit beneath the final output layer to disgorge an output, revealing something about how the model was processing information. Other methods included “ablation”—essentially removing chunks of the neural network—and then comparing how the model performs with how it originally performed.

What Anthropic has done in its new research is actually to train an entirely different model, called a cross-layer transcoder (CLT), that works using sets of interpretable features rather than the weights of individual neurons. An example of such features might be all conjugations of a particular verb, or any term that suggests “more than.” This lets the researchers better understand how a model works by allowing them to identify whole “circuits” of neurons that tend to be linked together.

“Our method decomposes the model, so we get pieces that are new, that aren’t like the original neurons, but there’s pieces, which means we can actually see how different parts play different roles,” Batson said. “It also has the advantage of allowing researchers to trace the entire reasoning process through the layers of the network.”

Still, Anthropic said the method did have some drawbacks. It is only an approximation of what is actually happening inside a complex model like Claude. There may be neurons that exist outside the circuits the CLT method identifies that play some subtle but critical role in the formulation of some model outputs. The CLT technique also doesn’t capture a key part of how LLMs work—which is something called attention, where the model learns to put a different degree of importance on different portions of the input prompt while formulating its output. This attention shifts dynamically as the model formulates its output. The CLT can’t capture these shifts in attention, which may play a critical role in LLM “thinking.”

Anthropic also said that discerning the network’s circuits, even for prompts that are only “tens of words” long, takes a human expert several hours. It said it isn’t clear how the technique could be scaled up to address prompts that were much longer. 

Correction, March 27: An earlier version of this story misspelled Anthropic researcher Josh Batson’s last name.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
Jeremy Kahn
By Jeremy KahnEditor, AI
LinkedIn iconTwitter icon

Jeremy Kahn is the AI editor at Fortune, spearheading the publication's coverage of artificial intelligence. He also co-authors Eye on AI, Fortune’s flagship AI newsletter.

See full bioRight Arrow Button Icon

Latest in Tech

InvestingStock
There have been head fakes before, but this time may be different as the latest stock rotation out of AI is just getting started, analysts say
By Jason MaDecember 13, 2025
36 minutes ago
Politicsdavid sacks
Can there be competency without conflict in Washington?
By Alyson ShontellDecember 13, 2025
1 hour ago
InnovationRobots
Even in Silicon Valley, skepticism looms over robots, while ‘China has certainly a lot more momentum on humanoids’
By Matt O'Brien and The Associated PressDecember 13, 2025
3 hours ago
Sarandos
Arts & EntertainmentM&A
It’s a sequel, it’s a remake, it’s a reboot: Lawyers grow wistful for old corporate rumbles as Paramount, Netflix fight for Warner
By Nick LichtenbergDecember 13, 2025
7 hours ago
Oracle chairman of the board and chief technology officer Larry Ellison delivers a keynote address during the 2019 Oracle OpenWorld on September 16, 2019 in San Francisco, California.
AIOracle
Oracle’s collapsing stock shows the AI boom is running into two hard limits: physics and debt markets
By Eva RoytburgDecember 13, 2025
8 hours ago
robots
InnovationRobots
‘The question is really just how long it will take’: Over 2,000 gather at Humanoids Summit to meet the robots who may take their jobs someday
By Matt O'Brien and The Associated PressDecember 12, 2025
22 hours ago

Most Popular

placeholder alt text
Economy
Tariffs are taxes and they were used to finance the federal government until the 1913 income tax. A top economist breaks it down
By Kent JonesDecember 12, 2025
1 day ago
placeholder alt text
Success
Apple cofounder Ronald Wayne sold his 10% stake for $800 in 1976—today it’d be worth up to $400 billion
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Success
40% of Stanford undergrads receive disability accommodations—but it’s become a college-wide phenomenon as Gen Z try to succeed in the current climate
By Preston ForeDecember 12, 2025
1 day ago
placeholder alt text
Economy
For the first time since Trump’s tariff rollout, import tax revenue has fallen, threatening his lofty plans to slash the $38 trillion national debt
By Sasha RogelbergDecember 12, 2025
23 hours ago
placeholder alt text
Economy
The Fed just ‘Trump-proofed’ itself with a unanimous move to preempt a potential leadership shake-up
By Jason MaDecember 12, 2025
21 hours ago
placeholder alt text
Success
At 18, doctors gave him three hours to live. He played video games from his hospital bed—and now, he’s built a $10 million-a-year video game studio
By Preston ForeDecember 10, 2025
3 days ago
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.