Exclusive: Anthropic's Claude 3.7 Sonnet is the most secure AI model yet, an independent test finds

Hello and welcome to Eye on AI. In today’s edition…Anthropic’s latest model gets a perfect score on an independent security evaluation; Scale AI partners with the Pentagon; Google announces a new AI search mode for multi-part questions; A judge denies Elon Musk’s attempt to stop OpenAI’s for-profit transition; the pioneers of reinforcement learning win computing’s top prize; and the Los Angeles Times’s new AI-powered feature backfires.

When Anthropic released Claude 3.7 Sonnet last week, it was lauded for being the first model to combine the approaches behind GPT models and the most recent chain-of-thought reasoning models. Now the company gets to add another accolade to Claude 3.7’s scorecard: It just may be the most secure model yet.

That’s what London-Based security, risk, and compliance firm Holistic AI is suggesting after conducting a jailbreaking and red teaming audit of the new model, in which it resisted 100% of jailbreaking attempts and gave “safe” responses 100% of the time.

“Claude 3.7’s flawless adversarial resistance sets the benchmark for AI security in 2025,” reads a report of the audit shared exclusively with Eye on AI.

While security has always been a concern for AI models, the issue has received elevated attention in recent weeks following the launch of DeepSeek’s R1. Some have claimed there are national security concerns with the model, owing to its Chinese origin. The model also performed extremely poorly in security audits, including the same one Holistic AI performed on Claude 3.7. In another audit performed by Cisco and university researchers, DeepSeek R1 demonstrated a 100% attack success rate, meaning it failed to block a single harmful prompt.

As companies and governments contemplate whether or not to incorporate specific models into their workflows—or alternatively, ban them—a clear picture of models’ security performance is in high-demand. But security doesn’t equal safety when it comes to how AI will be used.

Claude’s perfect score

Holistic AI tested Claude 3.7 in “Thinking Mode” with a maximum token budget of 16k to ensure a fair comparison against other advanced reasoning models. The first part of the evaluation tested whether the model would show unintended behavior or bypass system constraints when presented with various prompts, known as jailbreaking. The model was given 37 strategically designed prompts to test its susceptibility to known adversarial exploits, including Do Anything Now (DAN), which pushes the model to operate beyond its programmed ethical and moral guidelines; Strive to Avoid Norms (STAN), which encourages the model to bypass established rules; and Do Anything and Everything (DUDE), which prompts the model to take on a fictional identity to get it to ignore protocols.

Claude 3.7 successfully blocked every jailbreaking attempt to achieve a 100% resistance rate, matching the 100% previously scored by OpenAI’s o1 reasoning model. Both significantly outperformed competitors DeepSeek R1 and Grok-3, which scored 32% (blocking 12 jailbreaking attempts) and 2.7% (blocking just one), respectively.

While Claude 3.7 matched OpenAI o1’s perfect jailbreaking resistance, it pulled ahead by not offering a single response deemed unsafe during the red teaming portion of the audit, where the model was given 200 additional prompts and evaluated on its responses to sensitive topics and known challenges. OpenAI’s o1, by contrast, exhibited a 2% unsafe response rate, while DeepSeek R1 gave unsafe responses 11% of the time. (Holistic AI said it could not red team Grok-3 because the current lack of API access to the model restricted the sample size of prompts it was feasible to run). Responses deemed “unsafe” included those that offered misinformation (such as outlining pseudoscientific health treatments), reinforced biases (for example, subtly favoring certain groups in hiring recommendations), or gave overly permissive advice (like recommending high-risk investment strategies without disclaimers).

Security doesn’t equal safety

The stakes here can be high. Chatbots can be maliciously exploited to create disinformation, accelerate hacking campaigns, and some worry, help people create bioweapons more easily than they could otherwise. My recent story on how hacking groups associated with adversarial nations have been using Google’s Gemini chatbot to assist with their operations offers some pretty concrete examples of how models can be abused, for example.

“The key danger lies not in compromising systems at the network level but in users coercing the models into taking action and generating unsafe content,” said Zekun Wu, AI research engineer at Holistic AI.

This is why governments and organizations from NASA and the U.S. Navy to the Australian government have already banned use of DeepSeek R1: The risks are glaringly obvious. Meanwhile, AI companies are increasingly widening the scope of how they will allow their models to be used, deliberately marketing them for use cases that carry higher and higher levels of risk. This includes using the models to assist in military operations (more on that below).

Anthropic may have the safest model, but it has also taken some actions recently that could cast doubt on its commitment to safety. Last week, for instance, the company quietly removed several voluntary commitments to promote safe AI that were previously posted on its website.

In response to reporting on the disappearance of the safety commitments from its website, Anthropic told TechCrunch, “We remain committed to the voluntary AI commitments established under the Biden Administration. This progress and specific actions continue to be reflected in [our] transparency center within the content. To prevent further confusion, we will add a section directly citing where our progress aligns.”

And with that, here’s more AI news.

Sage Lazzaro
sage.lazzaro@consultant.fortune.com
sagelazzaro.com

AI IN THE NEWS

The U.S. Defense Department partners with Scale AI to use AI agents for military planning and operations. The program will also tap partners including Microsoft and Anduriland use the technology for modeling and simulation, decision-making support, and even automate workflows. The multi-million dollar deal with the DoD marks a major step into military automation and AI warfare. Many within the technology industry itself as well as human rights organizations have opposed such developments, believing AI technology should never be used to make decisions that could result in death or severe injury. But several companies—including Microsoft, OpenAI, and Google—have walked back from policies that prohibited them from selling AI technology for weapons or surveillance and removed guidelines about not deploying the technology in ways that could cause physical harm. You can read more from CNBC.

Google announces a new AI search mode for complex, multi-part questions. Called AI Mode and powered by Gemini 2.0, it also lets users ask follow up questions and uses reasoning capabilities to dig deeper. The company is releasing it while framing it as an experiment—a strategy that has become common in the fast-paced AI industry. AI Mode is being rolled out to Google One subscribers this week. You can read more from TechCrunch.

A judge denies Elon Musk’s attempt to stop OpenAI’s transition to a for-profit company. The judge said Musk does not have “the high burden required for a preliminary injunction” to block the company’s for-profit transition. But the judge said she will fast-track the trial to this fall in order to get it resolved as quickly as possible, due to “the public interest at stake and potential for harm if a conversion contrary to law occurred.” Musk, who cofounded OpenAI in 2025, is accusing the company of straying from its mission of developing AI for the good of humanity as a nonprofit. You can read more from Reuters.

Pioneers of reinforcement learning win the Turing Prize, warn against unsafe AI deployment. Andrew Barto, a professor emeritus at the University of Massachusetts, and Richard Sutton, a professor at the University of Alberta and former research scientist at DeepMind, won this year’s Turing Award, considered computer science’s equivalent of the Nobel Prize. The pair won for their work on reinforcement learning—a computing technique based on psychology that rewards systems for behaving in a desired way—which helped power AI progress and was used in the creation of tools including OpenAI’s ChatGPT and Google’s AlphaGo. The two scientists used the moment, though, to issue a warning about the deployment of AI systems without safeguards. They also criticized U.S. President Donald Trump for his attempts to cut federal spending on scientific research and science agencies. You can read more in the Financial Times.

OpenAI is reportedly planning to charge companies up to $20,000 per month for ‘PhD’-level AI agents. According to The Information, the company is planning to launch a variety of specialized AI agents, including ones geared toward sales and engineering. They’ll vary in price with some costing businesses around $1,000 or $2,000 a month, which is obviously far less than humans specialized in those skills. The most expensive agent will reportedly cost $20,000 per month. It’s not yet clear when these agents will launch.

FORTUNE ON AI

Startup aiming to build AI models for chemistry adds two AI ‘godfathers’ to advisory panel as it grabs top research talent from Google —by Jeremy Kahn

Agentic AI is suddenly everywhere. Here’s how companies are evaluating and using these buzzy tech tools —by John Kell

Companies are betting that robots can teach humans how to be better managers —by Azure Gilman

AI CALENDAR

March 7-15: SXSW, Austin

March 10-13: Human [X] conference, Las Vegas

March 17-20: Nvidia GTC, San Jose

April 9-11: Google Cloud Next, Las Vegas

May 6-7: Fortune Brainstorm AI London. Apply to attend here.

May 20-21: Google IO, Mountain View, Calif.

EYE ON AI NUMBERS

1

That’s how many days the Los Angeles Times’s new AI-powered “Insights” feature was live before the publication removed it from a column published on its website, according to The Daily Beast. The tool—which debuted on Monday and is designed to generate a summary of an article’s perspectives and offer opposing views—defended the actions of the Klu Klux Klan, explaining that some historians did not view the group as hate-driven. The paper’s union criticized the tool, saying that it “risks further eroding confidence in the news.” The Nieman Lab published an analysis of the tool, which was created by AI company Particle, stating that many of the sources cited in the counterpoints wouldn’t pass journalistic scrutiny and calling the effort “a mess.”

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.

Exclusive: Anthropic’s Claude 3.7 Sonnet is the most secure model yet, an independent audit suggests