• Home
  • News
  • Fortune 500
  • Tech
  • Finance
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechAI

With GPT-4, OpenAI’s chief scientist says the company has ‘a recipe for producing magic’

Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
March 15, 2023, 3:54 PM ET
akub Porzycki/NurPhoto via Getty Images

So it’s finally here: GPT-4. This is latest and greatest artificial intelligence system from OpenAI, and a successor to the A.I. model that powers the wildly popular ChatGPT.

OpenAI, the San Francisco A.I. lab that is now closely tied to Microsoft, says that GPT-4 is much more capable than the GPT-3.5 model underpinning the consumer version of ChatGPT. For one thing, GPT-4 is multi-modal: it can take in images as well as text, although it only outputs text. This opens up the ability of the A.I. model to “understand” photos and scenes. (Although for now this visual understanding capability is only being offered through OpenAI’s partnership with Be My Eyes, a free mobile app for the visually impaired.)

The new model performs much better than GPT-3.5 on a range of benchmark tests for natural language processing and computer vision algorithms. It also performs very well on a battery of diverse tests designed for humans, including a very impressive score on a simulated bar exam as well as scoring a five out of five on a wide range of Advanced Placement exams, from Math to Art History. (Interestingly, the system scores poorly on both the AP English Literature and AP English Composition exams and there is already some questions from machine learning experts about whether there may be less than meets the eye to GPT-4’s stellar exam performance.)

The model, according to OpenAI, is 40% more likely to return factual answers to questions—although it may still in some cases simply invent information, a phenomenon A.I. researchers call “hallucination.” It is also less likely to jump the guardrails OpenAI has given the model to try to keep it from spewing toxic or biased language, or recommending actions that might cause harm. OpenAI said GPT-4 is more likely to refuse such requests than GPT-3.5 was.

Still, GPT-4 still has many of the same potential risks and flaws as other large language models. It isn’t entirely reliable. Its answers are unpredictable. It can be used to produce misinformation. It can still be pushed to jump its guardrails and give outputs that might be unsafe, either because they might be hurtful to the person reading the output or because they might encourage the person to take actions that would harm themselves or others. It can be used, for instance, to help someone find ways to make improvised chemical weapons or explosives from household products.

Because of this, OpenAI cautioned users that “Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.” And yet, OpenAI has released the model as a paid service to ChatGPT Plus customers and businesses purchasing services through its cloud-based application programming interface (or API).

GPT-4’s release had been widely anticipated among those who follow A.I. developments. While ChatGPT took almost everyone by surprise when OpenAI released it in late November, it was widely known for at least a year that OpenAI was working on something called GPT-4, although there has been wild speculation about exactly what it would be. In fact, after ChatGPT became an unexpected viral sensation, massively ramping up hype around A.I., Sam Altman, the CEO of OpenAI, felt it necessary to try to tamp down expectations surrounding GPT-4’s imminent release. “The GPT-4 rumor mill is a ridiculous thing. I don’t know where it all comes from,” Altman said in an interview at an event in San Francisco in January. Referring to the idea of artificial general intelligence (or AGI), the kind of machine superintelligence that has been a staple of science fiction, he said, “people are begging to be disappointed and they will be. The hype is just like… We don’t have an actual AGI and that’s sort of what’s expected of us.”

Yesterday, I talked to several of the OpenAI researchers who helped build GPT-4 about its capabilities, limitations, and how they built it. The researchers spoke in general terms about the methods they used, but there is much about GPT-4 they are keeping under wraps, including the size of the model, exactly what data was used to train it, how many specialized computer chips (graphics processing units, or GPUs) were needed to train and run it, what its carbon footprint is, and more.

OpenAI CEO Sam Altman
ovelle Tamayo/ forThe Washington Post via Getty Images

OpenAI was co-founded by Elon Musk, who has said he chose the name because he wanted the new research lab to be dedicated to democratizing A.I. and being transparent, publishing all its research. Over the years, OpenAI has increasingly moved away from its founding dedication to transparency, and with little detail about GPT-4 being released, some computer scientists quipped that the lab should change its name. “I think we can call it shut on ‘Open’ AI,” tweeted Ben Schmidt, the vice president of design at a company called Nomic AI. “The 98 page paper introducing GPT-4 proudly declares that they’re disclosing *nothing* about the contents of their training set.”

Ilya Sutskever, OpenAI’s chief scientist, told Fortune the reason for this secrecy was primarily because “it is simply a competitive environment” and the company did not want commercial rivals to quickly replicate its achievement. He also said that in the future, as A.I. models became even more capable and “those capabilities could be easily very harmful,” it will be important for safety reasons to limit information about how the models were created.

At times, Sutskever spoke of GPT-4 in terms that seemed designed to sidestep serious discussion of its inner workings. He described a “recipe for producing magic” when discussing the high-level process of creating generative pre-trained transformers, or GPTs, the basic model architecture that underpins most large language models. “GPT-4 is the latest manifestation of this magic,” Sutskever said. In response to a question about how OpenAI had managed to reduce GPT-4’s tendency to hallucinate, Sutskever said, “We just teach it not to hallucinate.”

Six months of fine tuning for safety and ease-of-use

Two of Sutskever’s OpenAI colleauges did provide slightly more detail on how OpenAI “just taught it not to hallucinate.” Jakub Pachocki, a member of OpenAI’s technical staff, said the model’s increased size alone, and the larger amount of data it ingested during pre-training, seemed to be part of the reason for its increased accuracy. Ryan Lowe, who co-leads OpenAI’s team that works on “alignment,” or making sure A.I. systems do what humans want them to and don’t do things we don’t want them to do, said that the OpenAI also spent about six months after pre-training GPT-4 fine-tuning the model to be both safer and easier to use. One method it used, he said, was to collect human feedback on GPT-4’s outputs and then used those to push the model towards trying to generate responses that it predicted were more likely to get positive feedback from these human reviewers. This process, called “reinforcement learning from human feedback” was part of what made ChatGPT such an engaging and useful chatbot.

Lowe said some of the feedback used to refine GPT-4 came from the experience of ChatGPT users, showing the way in which getting that chatbot out into the hands of hundreds of millions of people before many competitors debuted rival systems may have created a faster-spinning “data flywheel” for OpenAI that gives the company an advantage in building future, advanced A.I. software that its rivals may find hard to match.

OpenAI specifically trained GPT-4 on more examples of accurate question-answering in order to boost the model’s ability to perform that task, and reduce the chances of it hallucinating, Lowe said. He also said that OpenAI used GPT-4 itself to generate simulated conversations and other data that was then fed back into the fine-tuning of GPT-4 to help it hallucinate less. This is another example of the “data flywheel” in action.

Is the “magic” reliable enough for release?

Sutskever defended OpenAI’s decision to release GPT-4, despite its limitations and risks. “The model is flawed, ok, but how flawed?” he said. “There are some safety mitigations that exist on the model right now,” he said, explaining that OpenAI judged these guardrails and safety measures to be effective enough to allow the company to release the model. He also noted that OpenAI’s terms and conditions of use prohibited certain malicious uses and that the company now had monitoring procedures in place to try to check that users were not violating those terms. He said this in combination with GPT-4’s better safety profile on key metrics like hallucinations and the ease with which it could be “jailbroken” or made to bypass guardrails, “made us feel that it is appropriate to proceed with the GPT-4 release, as we’re doing right now.”

In a demonstration for Fortune, OpenAI researchers asked the system to summarize an article about itself, but using only words that start with the letter ‘G’—which GPT-4 was able to do relatively coherently. Sutskever said that GPT-3.5 would have flubbed the task, resorting to some words that did not start with ‘G.’ In another example, GPT-4 was presented with part of the U.S. tax code and then given a scenario about a specific couple and asked to calculate how much tax they owed, with reference to the passage of regulations it had just been given. GPT-4 seemingly came up with the right amount of tax in about a second. (Although I was not able to go back through and double-check its answer.)

Despite impressive demonstrations, some A.I. researchers and technologists say that systems like GPT-4 are still not reliable enough for many enterprise use cases, particularly when it comes to information retrieval, because of the chance of hallucination. In cases where a human is asking it a question to which that user doesn’t know the answer, GPT-4 is still probably not appropriate. “Even if the hallucination rate goes down, until it is infinitesimal, or at least as small as would be the case with an expert human analyst, it is probably not appropriate to use it,” Aaron Kalb, co-founder and chief strategy officer at Alation, a software company that builds data cataloging and retrieval software.

He also said that even prompting the model to answer only from a particular set of data or only using the model to summarize information surfaced through a traditional search algorithm might not be sufficient to be certain the model wasn’t making up some part of its answer or surfacing inaccurate or outdated information that it had ingested during its pre-training.

Kalb said whether it was appropriate to use large language models would depend on the use case and whether it was practical for a human to review the A.I.’s answers. He said that asking GPT-4 to generate marketing copy, in cases where that copy is going to be reviewed and edited by a human, was probably fine. But in situations where it wasn’t possible for a human to fact-check everything the model produced, relying on GPT-4’s answers might be dangerous.

Subscribe to Well Adjusted, our newsletter full of simple strategies to work smarter and live better, from the Fortune Well team. Sign up today.
About the Author
Jeremy Kahn
By Jeremy KahnEditor, AI
LinkedIn iconTwitter icon

Jeremy Kahn is the AI editor at Fortune, spearheading the publication's coverage of artificial intelligence. He also co-authors Eye on AI, Fortune’s flagship AI newsletter.

See full bioRight Arrow Button Icon

Latest in Tech

Elon Musk
Big TechSpaceX
SpaceX to offer insider shares at record-setting $800 billion valuation
By Edward Ludlow, Loren Grush, Lizette Chapman, Eric Johnson and BloombergDecember 6, 2025
5 hours ago
Big TechApple
Apple rocked by executive departures, with chip chief at risk of leaving next
By Mark Gurman and BloombergDecember 6, 2025
7 hours ago
Nvidia CEO Jensen Huang said China is better equipped for an AI data center buildout than the U.S.
AITech
Nvidia CEO says data centers take about 3 years to construct in the U.S., while in China ‘they can build a hospital in a weekend’
By Nino PaoliDecember 6, 2025
10 hours ago
Arts & EntertainmentMedia
Former Amazon Studios boss warns the Netflix-Warner Bros. deal will make Hollywood ‘a system that circles a single sun’
By Jason MaDecember 6, 2025
11 hours ago
Jay Clayton
LawCrime
25-year DEA veteran charged with helping Mexican drug cartel launder millions of dollars, secure guns and bombs
By Dave Collins, Michael R. Sisak and The Associated PressDecember 6, 2025
11 hours ago
Elon Musk
LawSocial Media
Elon Musk’s X fined $140 million by EU for breaching digital regulations
By Kelvin Chan and The Associated PressDecember 6, 2025
12 hours ago

Most Popular

placeholder alt text
Big Tech
Mark Zuckerberg rebranded Facebook for the metaverse. Four years and $70 billion in losses later, he’s moving on
By Eva RoytburgDecember 5, 2025
1 day ago
placeholder alt text
AI
Nvidia CEO says data centers take about 3 years to construct in the U.S., while in China 'they can build a hospital in a weekend'
By Nino PaoliDecember 6, 2025
10 hours ago
placeholder alt text
Success
Nvidia CEO Jensen Huang admits he works 7 days a week, including holidays, in a constant 'state of anxiety' out of fear of going bankrupt
By Jessica CoacciDecember 4, 2025
2 days ago
placeholder alt text
Economy
Two months into the new fiscal year and the U.S. government is already spending more than $10 billion a week servicing national debt
By Eleanor PringleDecember 4, 2025
3 days ago
placeholder alt text
Success
‘Godfather of AI’ says Bill Gates and Elon Musk are right about the future of work—but he predicts mass unemployment is on its way
By Preston ForeDecember 4, 2025
2 days ago
placeholder alt text
Real Estate
The 'Great Housing Reset' is coming: Income growth will outpace home-price growth in 2026, Redfin forecasts
By Nino PaoliDecember 6, 2025
15 hours ago
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.