• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechAI

Anthropic makes a breakthrough in opening AI’s ‘black box’

Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
March 27, 2025, 1:00 PM ET
Anthropic CEO Dario Amodei raising his index finger.
Anthropic CEO Dario Amodei. Today the company announced that its researchers had made a breakthrough in probing how large language models, like the one that powers Anthropic's Claude chatbot, formulate responses.FABRICE COFFRINI/AFP—Getty Images

Researchers at the AI company Anthropic say they have made a fundamental breakthrough in our understanding of exactly how large language models, the type of AI responsible for the current boom, work. The breakthrough has important implications for how we may be able to make AI models safer, more secure, and more reliable in the future.

Recommended Video

One of the problems with today’s powerful AI that is based around large language models (LLMs) is that the models are black boxes. We can know what prompts we feed them and what output they produce, but exactly how they arrive at any particular response is a mystery, even to the AI researchers who build them.

This inscrutability creates all kinds of issues. It makes it difficult to predict when a model is likely to “hallucinate,” or confidently spew erroneous information. We know these large AI models are susceptible to various jailbreaks where they can be tricked into jumping guardrails (the limits the AI model developers try to put around a model’s outputs so that it doesn’t use racist language or write malware for someone or tell them how to build a bomb). But we don’t understand why some jailbreaks work better than others, or why the fine-tuning that is used to create the guardrails doesn’t result in strong enough inhibitions to prevent the models doing stuff their developers don’t want them to.

Our inability to understand how LLMs work has made some businesses hesitant to use them. If the models’ inner workings were more understandable, it might give companies more confidence to use the models more widely.

There are implications for our ability to retain control of increasingly powerful AI “agents” too. We know these agents are capable of “reward hacking”—finding ways to achieve a goal that were not what a user of the model intended. In some cases the models can be deceptive, lying to users about what they have done or are trying to do. And while the recent “reasoning” AI models produce what’s known as a “chain of thought”—a kind of plan for how to answer a prompt that involves what looks to a human like “self-reflection”—we don’t know if the chain of thought the model outputs accurately represents the steps it is taking (and there’s often evidence it might not be.)

Anthropic’s new research offers a pathway to solve at least some of these problems. Its scientists created a new tool for deciphering how LLM’s “think.” In essence, what the Anthropic researchers built is a bit like the fMRI scans neuroscientists use to scan the brains of human research subjects and uncover which brain regions seem to play the biggest role in different aspects of cognition. Having invented this fMRI-like tool, Anthropic then applied it to Anthropic’s Claude 3.5 Haiku model. Doing so, they were able to resolve several key questions about how Claude, and probably most other LLMs, work.

The researchers found that although LLMs like Claude are initially trained to just predict the next word in a sentence, in the process Claude does learn to do some longer-range planning, at least when it comes to certain kinds of tasks. For instance, when asked to write a poem, Claude finds words that make sense with the poem’s topic or theme that it wants to rhyme and then works backward to construct sentences that will end with those rhyming words.

They also found that Claude, which is trained to be multilingual, doesn’t have completely separate components for reasoning in each language. Instead, concepts that are common across languages are embedded in the same set of neurons within the model and the model seems to “reason” in this conceptual space and only then convert the output to the appropriate language.

The researchers also discovered that Claude is capable of lying about its chain of thought in order to please a user. The researchers showed this by asking the model a tough math problem, but then giving the model an incorrect hint about how to solve it.

In other cases, when asked an easier question that the model can answer more or less instantly, without having to reason, the model makes up a fictitious reasoning process. “Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of this having occurred,” Josh Batson, an Anthropic researcher who worked on the project.

The ability to trace the internal reasoning of LLMs opens new possibilities for auditing AI systems for security and safety concerns. It also may help researchers develop new training methods to improve the guardrails that AI systems have and to reduce hallucinations and other faulty outputs. 

Some AI experts dismiss LLM’s “black box problem” by saying that human minds are also frequently inscrutable to other humans and yet we depend on humans all the day. We can’t really tell what someone else is thinking—and in fact, psychologists have shown that sometimes we don’t even understand how our own thinking works, making up logical explanations after-the-fact to justify actions that we make either intuitively or largely due to emotional responses of which we may not even be conscious. We often wrongly assume that another person thinks more or less the same way we do—which can lead to all kinds of misunderstandings. But it also seems true that, very broadly speaking, humans do tend to think in somewhat similar ways, and that when we make mistakes, these mistakes fall into somewhat familiar patterns. (It’s the reason psychologists have been able to identify so many common cognitive biases.) Yet the issue with LLMs is that the way they arrive at outputs seems alien enough to how humans perform the same tasks that they can fail in ways that it would be highly unlikely for a human to.

Batson said that thanks to the kinds of techniques that he and other scientists are developing to probe these alien LLM brains—a field known as “mechanistic interpretability”—rapid progress is being made. “I think in another year or two, we’re going to know more about how these models think than we do about how people think,” he said. “Because we can just do all the experiments we want.”

Previous techniques for trying to probe how an LLM works focused on either trying to decipher individual neurons or small clusters of neurons within the neural network, or asking layers of the neural network that sit beneath the final output layer to disgorge an output, revealing something about how the model was processing information. Other methods included “ablation”—essentially removing chunks of the neural network—and then comparing how the model performs with how it originally performed.

What Anthropic has done in its new research is actually to train an entirely different model, called a cross-layer transcoder (CLT), that works using sets of interpretable features rather than the weights of individual neurons. An example of such features might be all conjugations of a particular verb, or any term that suggests “more than.” This lets the researchers better understand how a model works by allowing them to identify whole “circuits” of neurons that tend to be linked together.

“Our method decomposes the model, so we get pieces that are new, that aren’t like the original neurons, but there’s pieces, which means we can actually see how different parts play different roles,” Batson said. “It also has the advantage of allowing researchers to trace the entire reasoning process through the layers of the network.”

Still, Anthropic said the method did have some drawbacks. It is only an approximation of what is actually happening inside a complex model like Claude. There may be neurons that exist outside the circuits the CLT method identifies that play some subtle but critical role in the formulation of some model outputs. The CLT technique also doesn’t capture a key part of how LLMs work—which is something called attention, where the model learns to put a different degree of importance on different portions of the input prompt while formulating its output. This attention shifts dynamically as the model formulates its output. The CLT can’t capture these shifts in attention, which may play a critical role in LLM “thinking.”

Anthropic also said that discerning the network’s circuits, even for prompts that are only “tens of words” long, takes a human expert several hours. It said it isn’t clear how the technique could be scaled up to address prompts that were much longer. 

Correction, March 27: An earlier version of this story misspelled Anthropic researcher Josh Batson’s last name.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
Jeremy Kahn
By Jeremy KahnEditor, AI
LinkedIn iconTwitter icon

Jeremy Kahn is the AI editor at Fortune, spearheading the publication's coverage of artificial intelligence. He also co-authors Eye on AI, Fortune’s flagship AI newsletter.

See full bioRight Arrow Button Icon

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

Latest in Tech

AIRecruiting
To ease recruiters’ fears of being replaced by AI, Zillow experimented with ‘prompt-a-thons.’ Now the real estate giant has 6 new recruitment tools
By Paige McGlauflin and HR BrewJanuary 6, 2026
8 hours ago
zhan, deepak
AIRobotics
Robots are really advancing because they’re learning to think for themselves—and they’re close to figuring out door handles, execs say
By Nick LichtenbergJanuary 6, 2026
9 hours ago
LawAmazon
Amazon is cutting checks to millions of customers as part of a $2.5 billion FTC settlement. Here’s who qualifies and how to get paid
By Sydney LakeJanuary 6, 2026
10 hours ago
InvestingU.S. economy
Ray Dalio says AI is in ‘the early stages of a bubble,’ so watch out for 2026
By Tristan BoveJanuary 6, 2026
11 hours ago
musk
AISocial Media
Elon Musk’s Grok chatbot draws global backlash for generating sexualized images of women and children without consent
By Kelvin Chan and The Associated PressJanuary 6, 2026
11 hours ago
Databricks CEO Ali Ghodsi speaking on stage at a Fortune tech conference.
AIEye on AI
Want AI agents to work better? Improve the way they retrieve information, Databricks says
By Jeremy KahnJanuary 6, 2026
11 hours ago

Most Popular

placeholder alt text
Personal Finance
Janet Yellen warns the $38 trillion national debt is testing a red line economists have feared for decades
By Eva RoytburgJanuary 5, 2026
1 day ago
placeholder alt text
AI
Experienced software developers assumed AI would save them a chunk of time. But in one experiment, their tasks took 20% longer
By Sasha RogelbergJanuary 5, 2026
2 days ago
placeholder alt text
Success
Blackstone exec says elite Ivy League degrees aren’t good enough—new analysts need to 'work harder' and be nice 
By Ashley LutzJanuary 5, 2026
1 day ago
placeholder alt text
Personal Finance
Current price of silver as of Monday, January 5, 2026
By Joseph HostetlerJanuary 5, 2026
2 days ago
placeholder alt text
Personal Finance
Current price of gold as of January 5, 2025
By Danny BakstJanuary 5, 2026
2 days ago
placeholder alt text
Future of Work
Bank of America CEO says he hired 2,000 recent Gen Z grads from 200,000 applications, and many are scared about the future
By Ashley LutzJanuary 3, 2026
4 days ago

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.