Anthropic’s AI blackmail test sparks debate over transparency about risky model behavior

Welcome to Eye on AI! I’m pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums.

What’s the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what’s the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in “unethical” ways?

Some people in my network have called it “scary” and “crazy.” Others on social media have said it is “alarming” and “wild.”

I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open?

Anthropic released a 120-page safety report

When Anthropic released its 120-page safety report, or “system card,” last week after launching its Claude Opus 4 model, headlines blared how the model “will scheme,” “resorted to blackmail,” and had the “ability to deceive.” There’s no doubt that details from Anthropic’s safety report are disconcerting, though as a result of its tests, the model launched with stricter safety protocols than any previous one—a move that some did not find reassuring enough.

In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer’s affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive.

On social media, Anthropic received a great deal of backlash for revealing the model’s “ratting behavior” in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company’s own safety standards is about making sure AI improves for all. “We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,” he told me, calling Anthropic’s vision a “race to the top” that encourages other companies to be safer.

Could being open about AI model behavior backfire?

But it also seems likely that being so open about Claude Opus 4 could lead other companies to be less forthcoming about their models’ creepy behavior to avoid backlash. Recently, companies including OpenAI and Google have already delayed releasing their own system cards. In April, OpenAI was criticized for releasing its GPT-4.1 model without a system card because the company said it was not a “frontier” model and did not require one. And in March, Google published its Gemini 2.5 Pro model card weeks after the model’s release, and an AI governance expert criticized it as “meager” and “worrisome.”

Last week, OpenAI appeared to want to show additional transparency with a newly-launched Safety Evaluations Hub, which outlines how the company tests its models for dangerous capabilities, alignment issues, and emerging risks—and how those methods are evolving over time. “As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,” the page says. Yet, its effort was swiftly countered over the weekend as a third-party research firm studying AI’s “dangerous capabilities,” Palisade Research, noted on X that its own tests found that OpenAI’s o3 reasoning model “sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.”

It helps no one if those building the most powerful and sophisticated AI models are not as transparent as possible about their releases. According to Stanford University’s Institute for Human-Centered AI, transparency “is necessary for policymakers, researchers, and the public to understand these systems and their impacts.” And as large companies adopt AI for use cases large and small, while startups build AI applications meant for millions to use, hiding pre-release testing issues will simply breed mistrust, slow adoption, and frustrate efforts to address risk.

On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with.

Nathan Lambert, an AI researcher at AI2 Labs, recently pointed out that “the people who need information on the model are people like me—people trying to keep track of the roller coaster ride we’re on so that the technology doesn’t cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.”

We need more transparency, with context

There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It’s about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness.

Hiding AI test results won’t keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what’s going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must.

With that, here’s more AI news.

Sharon Goldman
sharon.goldman@fortune.com
@sharongoldman

AI IN THE NEWS

Meta restructures its AI organization. Meta is restructuring its AI teams to accelerate product development and stay competitive in the global AI race, according to an internal memo obtained by Axios. The company is now splitting its efforts into two main groups: an AI Products team, led by Connor Hayes, which will focus on consumer-facing tools like Meta AI, AI Studio, and features in Facebook, Instagram, and WhatsApp. The other will be an AGI Foundations team, co-led by Ahmad Al-Dahle and Amir Frenkel, which will drive advancements in core technologies like Meta’s Llama models, reasoning, multimedia, and voice. Meta’s research arm FAIR remains separate, though one multimedia team will join AGI Foundations. No jobs are being cut, but several leaders are shifting roles as part of the reorg.

Amazon coders say AI is making their jobs resemble warehouse work. The New York Times reported that Amazon software developers are complaining that their work has become “more routine, less thoughtful and, crucially, much faster paced,” due to a push to use AI tools. According to the article, three Amazon engineers said that managers had increasingly pushed them to use A.I. in their work over the past year while raising output goals and becoming more strict about deadlines. One Amazon engineer said his team was “roughly half the size it had been last year, but it was expected to produce roughly the same amount of code by using AI.” The Times noted that coding automation at Amazon is reminiscent of how the company’s warehouse workers have undergone a similar transition.

Salesforce announces $8 billion deal to acquire Informatica to boost AI capabilities. Salesforce agreed to pay $25 a share for cloud and data company Informatica, reported the Wall Street Journal. Informatica, which focuses on improving data quality and analysis, could help Salesforce in its quest to compete in the race to help enterprise companies adopt AI agents. Last year, Salesforce was reportedly close to a deal to buy Informatica, but the talks fell apart. The Salesforce deal for Informatica would rank as its biggest since it closed the roughly $28 billion acquisition of workplace-collaboration company Slack Technologies in 2021. Previously, Salesforce bought data-analytics platform Tableau Software for more than $15 billion, and MuleSoft for around $6.5 billion.

EYE ON AI RESEARCH

Do LLMs think alike? Large language models (LLM) seem to have their own mysterious ways–even top researchers like former OpenAI chief scientist Ilya Sutskever have described them as more like alchemy than chemistry. That is, their inner workings are mostly unknown.

But a new research paper from researchers at Cornell University found that there may be a deeper, shared structure behind how LLMs learn and represent knowledge. The paper found that many different AI models—even ones trained on different data—end up “thinking” in similar ways. They seem to organize ideas and concepts in a shared geometric structure, known as an embedding space. This implies there could be ways to better understand how LLMs arrive at their answers and make their “thinking” more interpretable to humans.

However, this could have concerning implications for the security of the AI models. For example, it could mean that if LLMs encode language in the same way, an attack against one model might work well against all models.

FORTUNE ON AI

AI-scaled startups are poised to disrupt venture capital—but VCs say don’t count them out just yet —by Luisa Beltran

Salesforce exec says rise of AI agents means ‘every job should be rethought’ —by Steve Mollman

Microsoft leader says adapting to the AI era requires ‘activating at every level of the organization’ —by Steve Mollman

AI CALENDAR

June 9-13: WWDC, Cupertino, Calif.

July 13-19: International Conference on Machine Learning (ICML), Vancouver

July 22-23: Fortune Brainstorm AI Singapore. Apply to attend here.

Sept. 8-10: Fortune Brainstorm Tech, Park City, Utah. Apply to attend here.

BRAIN FOOD

Can an AI chatbot really be helpful with beauty advice?

A Washington Post article details how AI chatbot users are going beyond drafting emails and researching ideas. How about beauty advice? Users are uploading photos and asking ChatGPT for “unsparing assessments of their looks” and sharing the results on social media. According to the article, many also prompt the chatbot to formulate a plan for them to “glow up,” or improve their appearance, and say they have received recommendations to purchase products from hair dye to Botox. Apparently some have spent thousands of dollars as a result.

What’s interesting is that the article points out that many users consider the bot’s opinions to be more impartial, even though it might have its own hidden biases that reflect their training data or their maker’s financial incentives.

For now, users might figure that a little hidden bias is no big deal—after all, we’ve been used to significant bias and manipulation from advertisers for our entire lives. But if AI companies begin to include sponsored results in their output, the idea that ChatGPT is somehow more objective may no longer ring true. But maybe people just won’t care, as long as they like the results.

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.