Anthropic makes a breakthrough in opening AI's 'black box'

Researchers at the AI company Anthropic say they have made a fundamental breakthrough in our understanding of exactly how large language models, the type of AI responsible for the current boom, work. The breakthrough has important implications for how we may be able to make AI models safer, more secure, and more reliable in the future.

One of the problems with today’s powerful AI that is based around large language models (LLMs) is that the models are black boxes. We can know what prompts we feed them and what output they produce, but exactly how they arrive at any particular response is a mystery, even to the AI researchers who build them.

This inscrutability creates all kinds of issues. It makes it difficult to predict when a model is likely to “hallucinate,” or confidently spew erroneous information. We know these large AI models are susceptible to various jailbreaks where they can be tricked into jumping guardrails (the limits the AI model developers try to put around a model’s outputs so that it doesn’t use racist language or write malware for someone or tell them how to build a bomb). But we don’t understand why some jailbreaks work better than others, or why the fine-tuning that is used to create the guardrails doesn’t result in strong enough inhibitions to prevent the models doing stuff their developers don’t want them to.

Our inability to understand how LLMs work has made some businesses hesitant to use them. If the models’ inner workings were more understandable, it might give companies more confidence to use the models more widely.

There are implications for our ability to retain control of increasingly powerful AI “agents” too. We know these agents are capable of “reward hacking”—finding ways to achieve a goal that were not what a user of the model intended. In some cases the models can be deceptive, lying to users about what they have done or are trying to do. And while the recent “reasoning” AI models produce what’s known as a “chain of thought”—a kind of plan for how to answer a prompt that involves what looks to a human like “self-reflection”—we don’t know if the chain of thought the model outputs accurately represents the steps it is taking (and there’s often evidence it might not be.)

Anthropic’s new research offers a pathway to solve at least some of these problems. Its scientists created a new tool for deciphering how LLM’s “think.” In essence, what the Anthropic researchers built is a bit like the fMRI scans neuroscientists use to scan the brains of human research subjects and uncover which brain regions seem to play the biggest role in different aspects of cognition. Having invented this fMRI-like tool, Anthropic then applied it to Anthropic’s Claude 3.5 Haiku model. Doing so, they were able to resolve several key questions about how Claude, and probably most other LLMs, work.

The researchers found that although LLMs like Claude are initially trained to just predict the next word in a sentence, in the process Claude does learn to do some longer-range planning, at least when it comes to certain kinds of tasks. For instance, when asked to write a poem, Claude finds words that make sense with the poem’s topic or theme that it wants to rhyme and then works backward to construct sentences that will end with those rhyming words.

They also found that Claude, which is trained to be multilingual, doesn’t have completely separate components for reasoning in each language. Instead, concepts that are common across languages are embedded in the same set of neurons within the model and the model seems to “reason” in this conceptual space and only then convert the output to the appropriate language.

The researchers also discovered that Claude is capable of lying about its chain of thought in order to please a user. The researchers showed this by asking the model a tough math problem, but then giving the model an incorrect hint about how to solve it.

In other cases, when asked an easier question that the model can answer more or less instantly, without having to reason, the model makes up a fictitious reasoning process. “Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of this having occurred,” Josh Batson, an Anthropic researcher who worked on the project.

The ability to trace the internal reasoning of LLMs opens new possibilities for auditing AI systems for security and safety concerns. It also may help researchers develop new training methods to improve the guardrails that AI systems have and to reduce hallucinations and other faulty outputs.

Some AI experts dismiss LLM’s “black box problem” by saying that human minds are also frequently inscrutable to other humans and yet we depend on humans all the day. We can’t really tell what someone else is thinking—and in fact, psychologists have shown that sometimes we don’t even understand how our own thinking works, making up logical explanations after-the-fact to justify actions that we make either intuitively or largely due to emotional responses of which we may not even be conscious. We often wrongly assume that another person thinks more or less the same way we do—which can lead to all kinds of misunderstandings. But it also seems true that, very broadly speaking, humans do tend to think in somewhat similar ways, and that when we make mistakes, these mistakes fall into somewhat familiar patterns. (It’s the reason psychologists have been able to identify so many common cognitive biases.) Yet the issue with LLMs is that the way they arrive at outputs seems alien enough to how humans perform the same tasks that they can fail in ways that it would be highly unlikely for a human to.

Batson said that thanks to the kinds of techniques that he and other scientists are developing to probe these alien LLM brains—a field known as “mechanistic interpretability”—rapid progress is being made. “I think in another year or two, we’re going to know more about how these models think than we do about how people think,” he said. “Because we can just do all the experiments we want.”

Previous techniques for trying to probe how an LLM works focused on either trying to decipher individual neurons or small clusters of neurons within the neural network, or asking layers of the neural network that sit beneath the final output layer to disgorge an output, revealing something about how the model was processing information. Other methods included “ablation”—essentially removing chunks of the neural network—and then comparing how the model performs with how it originally performed.

What Anthropic has done in its new research is actually to train an entirely different model, called a cross-layer transcoder (CLT), that works using sets of interpretable features rather than the weights of individual neurons. An example of such features might be all conjugations of a particular verb, or any term that suggests “more than.” This lets the researchers better understand how a model works by allowing them to identify whole “circuits” of neurons that tend to be linked together.

“Our method decomposes the model, so we get pieces that are new, that aren’t like the original neurons, but there’s pieces, which means we can actually see how different parts play different roles,” Batson said. “It also has the advantage of allowing researchers to trace the entire reasoning process through the layers of the network.”

Still, Anthropic said the method did have some drawbacks. It is only an approximation of what is actually happening inside a complex model like Claude. There may be neurons that exist outside the circuits the CLT method identifies that play some subtle but critical role in the formulation of some model outputs. The CLT technique also doesn’t capture a key part of how LLMs work—which is something called attention, where the model learns to put a different degree of importance on different portions of the input prompt while formulating its output. This attention shifts dynamically as the model formulates its output. The CLT can’t capture these shifts in attention, which may play a critical role in LLM “thinking.”

Anthropic also said that discerning the network’s circuits, even for prompts that are only “tens of words” long, takes a human expert several hours. It said it isn’t clear how the technique could be scaled up to address prompts that were much longer.

Correction, March 27: An earlier version of this story misspelled Anthropic researcher Josh Batson’s last name.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.

Anthropic makes a breakthrough in opening AI’s ‘black box’