• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
AIOpenAI

OpenAI’s new safety tools are designed to make AI models harder to jailbreak. Instead, they may give users a false sense of security

By
Beatrice Nolan
Beatrice Nolan
Tech Reporter
Down Arrow Button Icon
By
Beatrice Nolan
Beatrice Nolan
Tech Reporter
Down Arrow Button Icon
November 5, 2025, 9:58 AM ET
OpenAI logo on a keyboard.
OpenAI last week unveiled two new open-weight tools.Samuel Boivin—NurPhoto/Getty Images

OpenAI last week unveiled two new free-to-download tools that are supposed to make it easier for businesses to construct guardrails around the prompts users feed AI models and the outputs those systems generate.

Recommended Video

The new guardrails are designed so a company can, for instance, more easily set up controls to prevent a customer service chatbot responding with a rude tone or revealing internal policies about how it should make decisions around offering refunds, for example.

But while these tools are designed to make AI models safer for business customers, some security experts caution that the way OpenAI has released them could create new vulnerabilities and give companies a false sense of security. And while OpenAI says it has released these security tools for the good of everyone, some question whether OpenAI’s motives are driven in part by a desire to blunt one advantage that its AI rival Anthropic has; it’s been gaining traction among business users in part because of a perception that its Claude models have more robust guardrails than other competitors.

The OpenAI security tools—which are called gpt-oss-safeguard-120b and gpt-oss-safeguard-20b—are themselves a type of AI model known as a classifier, which is designed to assess whether the prompt a user submits to a larger, more general-purpose AI model, as well as what that larger AI model produces, meets a set of rules. Companies that purchase and deploy AI models could, in the past, train these classifiers themselves, but the process was time-consuming and potentially expensive, since developers would have to collect examples of content that violates the policy in order to train the classifier. And then, if the company wanted to adjust the policies used for the guardrails, they would have to collect new examples of violations and retrain the classifier.

OpenAI is hoping the new tools can make that process faster and more flexible. Rather than being trained to follow one fixed rulebook, these new security classifiers can simply read a written policy and apply it to new content.

OpenAI says this method, which it calls “reasoning-based classification,” allows companies to adjust their safety policies as easily as editing the text in a document instead of rebuilding an entire classification model. The company is positioning the release as a tool for enterprises that want more control over how their AI systems handle sensitive information, such as medical records or personnel records.

However, while the tools are supposed to be safer for enterprise customers, some safety experts say that they instead may give users a false sense of security. That’s because OpenAI has open-sourced the AI classifiers. That means they have made all the code for the classifiers available for free, including the weights, or the internal settings of the AI models.

Classifiers act like extra security gates for an AI system, designed to stop unsafe or malicious prompts before they reach the main model. But by open-sourcing them, OpenAI risks sharing the blueprints to those gates. That transparency could help researchers strengthen safety mechanisms, but it might also make it easier for bad actors to find the weak spots and risks, creating a kind of false comfort.

“Making these models open-source can help attackers as well as defenders,” David Krueger, an AI safety professor at Mila, told Fortune. “It will make it easier to develop approaches to bypassing the classifiers and other similar safeguards.”

For instance, when attackers have access to the classifier’s weights, they can more easily develop what are known as “prompt injection” attacks, where they create prompts that trick the classifier into disregarding the policy it is supposed to be enforcing. Security researchers have found that in some cases even a string of characters that look nonsensical to a person can, for reasons researchers don’t entirely understand, persuade an AI model to disregard its guardrails and do something it is not supposed to, such as offer advice for making a bomb or spew racist abuse.

Representatives for OpenAI directed Fortune to the company’s blog post announcement and technical report on the models.

Short-term pain for long-term gain

Open-source can be a double-edged sword when it comes to safety. It allows researchers and developers to test, improve, and adapt AI safeguards more quickly, increasing transparency and trust. For instance, there may be ways in which security researchers could adjust the model’s weights to make it more robust against prompt injection without degrading the model’s performance.

But it can also make it easier for attackers to study and bypass those very protections—for instance, by using other machine learning software to run through hundreds of thousands of possible prompts until it finds ones that will cause the model to jump its guardrails. What’s more, security researchers have found that these kinds of automatically generated prompt injection attacks developed on open-source AI models will also sometimes work against proprietary AI models, where the attackers don’t have access to the underlying code and model weights. Researchers have speculated this is because there may be something inherent in the way all large language models encode language that enables similar prompt injections to have success against any AI model.

In this way, open-sourcing the classifiers may not just give users a false sense of security that their own system is well guarded, it may actually make every AI model less secure. But experts said that this risk was probably worth taking because open-sourcing the classifiers should also make it easier for all of the world’s security experts to find ways to make the classifiers more resistant to these kinds of attacks.

“In the long term, it’s beneficial to kind of share the way your defenses work. It may result in some kind of short-term pain. But in the long term, it results in robust defenses that are actually pretty hard to circumvent,” said Vasilios Mavroudis, principal research scientist at the Alan Turing Institute.

Mavroudis said that while open-sourcing the classifiers could, in theory, make it easier for someone to try to bypass the safety systems on OpenAI’s main models, the company likely believes this risk is low. He said that OpenAI has other safeguards in place, including having teams of human security experts continually trying to test their models’ guardrails in order to find vulnerabilities and hopefully improve them.

“Open-sourcing a classifier model gives those who want to bypass classifiers an opportunity to learn about how to do that. But determined jailbreakers are likely to be successful anyway,” said Robert Trager, codirector of the Oxford Martin AI Governance Initiative.

“We recently came across a method that bypassed all safeguards of the major developers around 95% of the time—and we weren’t looking for such a method. Given that determined jailbreakers will be successful anyway, it’s useful to open-source systems that developers can use for the less-determined folks,” he added.

The enterprise AI race

The release also has competitive implications, especially as OpenAI looks to challenge rival AI company Anthropic’s growing foothold among enterprise customers. Anthropic’s Claude family of AI models have become popular with enterprise customers partly because of their reputation for stronger safety controls compared with other AI models. Among the safety tools Anthropic uses are “constitutional classifiers” that work similarly to the ones OpenAI just open-sourced.

Anthropic has been carving out a market niche with enterprise customers, especially when it comes to coding. According to a July report from Menlo Ventures, Anthropic holds 32% of the enterprise large language model market share by usage compared with OpenAI’s 25%. In coding‑specific use cases, Anthropic reportedly holds 42%, while OpenAI has 21%. By offering enterprise-focused tools, OpenAI may be attempting to win over some of these business customers, while also positioning itself as a leader in AI safety.

Anthropic’s “constitutional classifiers” consist of small language models that check a larger model’s outputs against a written set of values or policies. By open-sourcing a similar capability, OpenAI is effectively giving developers the same kind of customizable guardrails that helped make Anthropic’s models so appealing.

“From what I’ve seen from the community, it seems to be well received,” Mavroudis said. “They see the model as potentially a way to have auto-moderation. It also comes with some good connotation, as in, ‘We’re giving to the community.’ It’s probably also a useful tool for small enterprises where they wouldn’t be able to train such a model on their own.”

Some experts also worry that open-sourcing these safety classifiers could centralize what counts as “safe” AI.

“Safety is not a well-defined concept. Any implementation of safety standards will reflect the values and priorities of the organization that creates it, as well as the limits and deficiencies of its models,” John Thickstun, an assistant professor of computer science at Cornell University, told VentureBeat. “If industry as a whole adopts standards developed by OpenAI, we risk institutionalizing one particular perspective on safety and short-circuiting broader investigations into the safety needs for AI deployments across many sectors of society.”

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
By Beatrice NolanTech Reporter
Twitter icon

Beatrice Nolan is a tech reporter on Fortune’s AI team, covering artificial intelligence and emerging technologies and their impact on work, industry, and culture. She's based in Fortune's London office and holds a bachelor’s degree in English from the University of York. You can reach her securely via Signal at beatricenolan.08

See full bioRight Arrow Button Icon

Latest in AI

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Most Popular

placeholder alt text
North America
Gates Foundation plans to give away $9 billion in 2026 to prepare for the 2045 closure while slashing hundreds of jobs
By Sydney LakeJanuary 23, 2026
2 days ago
placeholder alt text
Europe
Denmark offered to trade Greenland to the U.S. in 1910—and America thought it was crazy
By Steven Lamy and The ConversationJanuary 22, 2026
3 days ago
placeholder alt text
Personal Finance
Sweden abolished its wealth tax 20 years ago. Then it became a 'paradise for the super-rich'
By Miranda Sheild Johansson and The ConversationJanuary 22, 2026
3 days ago
placeholder alt text
C-Suite
Jamie Dimon’s reality check for ambitious workers: ‘There’s going to be a grunt part to every part of a job. Get over it’
By Jake AngeloJanuary 23, 2026
2 days ago
placeholder alt text
Politics
Latest deadly shooting by federal agents pushes government closer to shutdown as Trump claims Minnesota officials are 'inciting insurrection'
By Jason MaJanuary 24, 2026
18 hours ago
placeholder alt text
Success
Apple cofounder Ronald Wayne sold his 10% stake for $800 in 1976—today it’d be worth up to $400 billion
By Preston ForeJanuary 23, 2026
2 days ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.


Latest in AI

AIRecruiting
Silicon Valley talent keeps getting recycled, so this CEO uses a ‘moneyball’ approach for uncovering hidden AI geniuses in the new era
By Sydney LakeJanuary 25, 2026
5 hours ago
AIthe future of work
Meet a 70-year-old Home Depot store associate who uses AI on his phone about once an hour: ‘I think my job would suffer if I couldn’t’
By Matt O'Brien, Linley Sanders and The Associated PressJanuary 25, 2026
5 hours ago
lakehouse
AIConsulting
Inside KPMG’s Orlando Lakehouse: the $450 million Covid boondoggle that’s becoming a secret weapon for the AI revolution
By Nick LichtenbergJanuary 25, 2026
8 hours ago
Meta CEO Mark Zuckerberg in Menlo Park, California on Sept. 17, 2025. (Photo: David Paul Morris/Bloomberg/Getty Images)
AIData centers
Why Meta is positioning itself as an AI infrastructure giant—and doubling down on a costly new path
By Sharon GoldmanJanuary 24, 2026
1 day ago
IMF managing director Kristalina Georgieva speaks to reporters outside during the 2026 World Economic Forum in Davos, Switzerland.
LawEconomics
AI productivity gains are making the rich richer, and they’ll wipe out jobs—but the IMF chief sees a silver lining for low-wage workers
By Tristan BoveJanuary 24, 2026
1 day ago
Dario Amodei looking up
AIAnthropic
Anthropic’s head of Claude Code on how the tool won over non-coders—and kickstarted a new era for software engineers
By Beatrice NolanJanuary 24, 2026
1 day ago