GPT-4, Bard, and more are here, but we’re running low on GPUs and hallucinations remain

Jeremy KahnBy Jeremy KahnEditor, AI
Jeremy KahnEditor, AI

Jeremy Kahn is the AI editor at Fortune, spearheading the publication's coverage of artificial intelligence. He also co-authors Eye on AI, Fortune’s flagship AI newsletter.

OpenAI President Greg Brockman speaking on stage at South by Southwest.
Greg Brockman, OpenAI's co-founder and president, speaks at South by Southwest. The company's long-awaited and eagerly-anticipated GPT-4 A.I. model was unveiled last week.
Errich Petersen—Getty Images for SXSW

Wow, what a week—perhaps the most eventful week in A.I. (at least, in terms of sheer volume of announcements) that I can remember in the past seven years writing about this topic.

– Google and Microsoft each began pushing generative A.I. capabilities into their rival office productivity software.

– Chinese search giant Baidu launched its Ernie Bot, a large language model-based chatbot that can converse in both Chinese and English, only to see its stock get hammered because the company used a pre-recorded demo in the launch presentation. (Baidu, in a defensive statement emailed to me yesterday, implied it was the victim of an unfair double-standard: Microsoft and Google also used pre-recorded demos when they unveiled their search chatbots. And while Google’s stock did take a hit for an error its Bard chatbot made, no one seemed upset that the demos weren’t live.)

– Midjourney released the fifth generation of its text-to-image generation software, which can produce very professional-looking, photo realistic images. And Runway, one of the companies that helped create Midjourney’s open-source competitor Stable Diffusion, released Gen-2 which creates very short videos from scratch based on a text prompt.

– And just as I was preparing this newsletter, Google announced it is publicly releasing Bard, its A.I.-powered chatbot with internet search capabilities. Google unveiled Bard, its answer to Microsoft’s Bing chat, a few weeks ago but it was only available to employees—now, a limited number of public users in the U.S. and the U.K. will be able to try the chatbot.

But let’s focus on what was by far the most widely anticipated news of the past week: OpenAI’s unveiling of GPT-4, a successor to the large language model GPT-3.5 that underpins ChatGPT. The model is also multimodal, meaning you can upload an image to it and it will describe the image. In a clever demonstration, OpenAI cofounder and president Greg Brockman drew a very rough sketch on a piece of paper of a website homepage, uploaded that to GPT-4, and asked it to write the code needed to generate the website—and it did.

A couple of key points to note though: There’s a great deal about GPT-4 that we don’t know because OpenAI has revealed almost nothing about how large a model it is, what data it was trained on, how many specialized computer chips (known as graphics processing units, or GPUs) it took to train, or what its carbon footprint might be. OpenAI has said it is keeping all these details secret for both competitive reasons and what it says are safety concerns. (In an interview, OpenAI’s chief scientist Ilya Sutskever told me it was primarily competitive concerns that had made the company decide to say so little about how they built GPT-4.)

Because we know almost nothing about how it was trained and built, there have been a number of questions about how to interpret some of the headline-grabbing performance figures for GPT-4 that OpenAI did publish. The stellar performance that GPT-4 turned in on computer programming questions from Codeforces’ coding contests in particular has been called into question. Since GPT-4 was trained on so much data, some believe there’s a decent chance it was trained with some of the exact same coding questions it was tested on. If that’s the case, GPT-4 may simply have shown that it’s good at memorizing answers rather than at actually answering never-before seen questions. The same data “contamination” issue might apply to GPT-4’s performance on other tests too. (And, as many have pointed out, just because GPT-4 can pass the bar exam with flying colors doesn’t mean it is about to be able to practice law as well as human.)

Another thing about GPT-4: Although we don’t know how many GPUs it takes to run, the answer is probably a heck of a lot. One indication of this is the way that OpenAI is having to throttle usage of GPT-4 through ChatGPT Plus. “GPT-4 currently has a cap of 25 messages every 3 hours. Expect significantly lower caps, as we adjust for demand,” reads the disclaimer that greets those who want to chat with GPT-4. Lack of GPU capacity may become a serious challenge to how quickly generative A.I. is adopted by businesses. The Information reported that teams within Microsoft that wanted to use GPUs for various research efforts were being told they would need special approval since the bulk of the company’s vast GPU capacity across its datacenters was now going to support new generative A.I. features in Bing and its first Office customers, as well as all of the Azure customers using OpenAI’s models. Charles Lamanna, Microsoft’s corporate vice president for business applications and low code platforms, told me that “there’s not infinite GPUs and if everybody uses it for every event, every team’s meeting, there’s probably not enough, right?” He told me Microsoft was prioritizing GPUs for areas that had the highest impact and “highest confidence of a return for our customers.” Look for discussions about limited GPU capacity holding back the implementation of generative A.I. in business to become more prevalent in the weeks and months ahead.

Most importantly, GPT-4, like all large language models, still has a hallucination problem. OpenAI says that GPT-4 is 40% less likely to make things up than its predecessor, ChatGPT, but the problem still exists—and might even be more dangerous in some ways because GPT-4 hallucinates less often, so humans may be more likely to be caught off guard when it does. So the other term you are going to start hearing a lot more about is “grounding”—or, how do you make sure that the output of a large language model is rooted in some specific, verified data that you’ve fed it and not something that it has just invented or drawn from its pretraining data.

Microsoft made a big deal about how its “Copilot” system—which is underpinning its deployment of GPT models into its Office and its Power applications—goes through a number of steps to make sure the output of the large language model is grounded in the data the user is giving it. These steps take place both on the input given to the LLM and on the output it generates.

Arijit Sengupta, the cofounder and CEO of machine learning platform Aible, reached out to me to point out that even with a 40% improvement in accuracy, GPT-4 still, according to the “technical report” OpenAI released, is inaccurate between 20% and 25% of the time. “That means you can never use it in the enterprise,” Sengupta says—at least not on its own. Aible, he says, has developed methods for ensuring that large language models can be used in situations where the output absolutely has to be grounded in accurate data. The system, which Aible is calling the Business Nervous System, sounds like it functions similarly to what Microsoft has tried to do with its Copilot system.

Aible’s system starts by using meta-prompts to instruct the large language model to only reference a particular dataset in producing its answer. Sengupta compares this to giving a cook a recipe for how to bake a cake. Next, it uses a more standard semantic parsing and information retrieval algorithms to check that all the factual claims the large language model is making are actually found within the dataset it was supposed to reference. In cases where it cannot find the model’s output in the dataset, it prompts the model to try again, and if it still fails—which Sengupta says happens in about 5% of cases in Aible’s experience so far—it flags that output as a failure case so that a customer knows not to rely on it. He says this is much better than a situation where you know the model is wrong 25% of the time, but you don’t know which 25%. Expect to hear a lot more about “grounding” in the weeks and months ahead too.

And with that here’s the rest of this week’s news in A.I.

Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

U.K. spy agency urges caution when using ChatGPT. The warning, which British newspaper the Telegraph reported, came in the form of an advisory from the National Cyber Security Center (which is run by the U.K.’s signals intelligence division GCHQ) that urged users not to reveal sensitive information in queries and prompts given to ChatGPT and other large language models. This information could be visible to employees at the companies that make these large language models available through APIs, such as OpenAI—and might be stored digitally by them, offering a potential attack surface for nation-state hackers looking to find sensitive data. It also said that cybercriminals might use LLMs to plan and execute cyberattacks, taking advantage of the ability of these A.I. models to write code and also to craft convincing phishing emails.

A.I. startup Adept raises $350 million venture capital round. The Series B round was led by General Catalyst and Spark Capital, Reuters reported. Founded by veterans from Google and OpenAI, the San Francisco-based Adept is building a system that understands natural language instructions and then can perform business actions using software—such as grabbing data from a spreadsheet, conducting an analysis, and then putting that analysis into a chart.

Investor Reid Hoffman uses GPT-4 to write a book. Hoffman, the former PayPal and Linkedin cofounder who recently stepped down from the board of OpenAI’s nonprofit foundation and whose own family foundation was an early investor in OpenAI’s for-profit arm, published a book that he cowrote with GPT-4. Hoffman, who is an investor at venture capital firm Greylock, had early access to the OpenAI language model and used it to help him write a book about the impact generative A.I. is likely to have on work, education, and society. Hoffman claims, probably accurately, that the book is the first cowritten with GPT-4 and that he wrote it in part to show others what is possible when using the powerful A.I. software as an assistant.

Rolls Royce shutters internal A.I. startup. The company closed the A.I. spin-out, which was called the R2 Factory, after just a year, according to the Financial Times. Rolls-Royce has been going through a major restructuring under new CEO Tufan Erginbilgic, who called the company “a burning platform,” and the closure of the A.I. unit was blamed on a “tough economic environment and embryonic nature of the business.” R2 Factory was supposed to help outside companies employ the same data analytics and A.I. solutions for supply chain optimization and energy efficiency that Rolls Royce had been applying internally.

Controversial facial recognition company PimEyes may have scraped the internet for images of dead people. That’s according to a lengthy story from Wired, which found evidence that the company—which charges a service to match a photo of someone with other photos on the Internet and then charges them even more to make it so those same photos can’t be found by others using the site’s photo search tools—had seemingly scraped photos of deceased people from online memorial pages and from sites like Ancestry.com. The publication talked to experts who expressed concern that such data could be used to identify living relatives of those deceased people based on similar facial features. It also apparently violated the terms and conditions of at least some sites, such as Ancestry. Giorgi Gobronidze, the company’s director, told Wired that “PimEyes only crawls websites who officially allow us to do so. It was…very unpleasant news that our crawlers have somehow broken the rule.” PimEyes is now blocking Ancestry’s domain and indexes related to it are being erased, he said.

EYE ON A.I. RESEARCH

Can you have your very own LLM for just $600? That’s what an experiment from some researchers at Stanford University suggests. They took the code and model weights for Meta’s large language model LLaMa that had leaked to 4chan.com and then used responses from OpenAI’s GPT-Text-Davinci-003 model (close to what powered the original ChatGPT) to create instructions for the new LLaMa to follow. This resulted in new model that the researchers named Alpaca and which they say performs similarly to GPT-Text-Davinci-003 but cost them barely anything to create (just about $600, with $500 going to obtain the GPT responses from OpenAI and $100 to compute costs for fine-tuning the model on those responses). This may be important for a bunch of reasons: It shows that the commercial companies hoping to make a mint from their LLMs may end up seeing any advantage they have eroded by open-source models (or even pirated models as is the case with LLaMa). It also shows that hopes that bad actors—or state actors with ill-intentions—would be prevented from obtaining the most powerful LLMs due to restricted access are probably in vain. These models are likely to become ubiquitous. You can read more about Stanford’s Alpaca model here.

FORTUNE ON A.I.

OpenAI GPT-4 users win followers by sharing how they’re using it—including to start businesses in ‘HustleGPT challenge’—by Steve Mollman

OpenAI CEO Sam Altman warns that other A.I. developers working on ChatGPT-like tools won’t put on safety limits—and the clock is ticking—by Steve Mollman

Cruise says it doesn’t need a greenlight from federal highway regulators to test its autonomous Origin shuttle on Texas roads—by Andrea Guzman

With GPT-4, OpenAI’s chief scientist says the company has ‘a recipe for producing magic’—by Jeremy Kahn

BRAINFOOD

Bias in large foundation models can be subtle, insidious, and hard to root out. There has been a lot of discussion of fairly obvious racial and gender bias in large foundation models. If you ask a large language model to write a story about a computer programmer, most of the time that character will be a male. If you ask a text-to-image generator to display a typical suburban American family, that family is likely to be white. But the models are also prone to other strange biases and inaccuracies. Margaret Mitchell, who is now a researcher and chief ethics scientist at open source A.I. firm Hugging Face, highlighted one intriguing case this past week on Twitter. She retweeted a critique from Hiroko Yoda, who identified herself as a “certified kimono consultant in Japan,” in which Yoda had found numerous flaws in the design and draping of a kimono depicted in an image of a geisha generated by a text-to-image generator. To an untrained obsever, the image looks fantastic. But to Yoda it was an abomination, especially because one of the garments the woman in the picture wears is folded in a way only used when burying the dead. So not only was the picture inaccurate, it was potentially deeply offensive. Subtle cultural insensitivities and biases like this could easily proliferate as more and more people and businesses turn to generative A.I. imagery.

This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays and Fridays. Sign up here.