ChatGPT’s inaccuracies are causing real harm

February 28, 2023, 7:06 PM UTC
Photo of Microsoft CEO Satya Nadella.
Microsoft CEO Satya Nadella. The company has had to limit conversations with its new OpenAI-powered Bing chat feature to prevent it from veering off into a disturbing persona that calls itself Sydney.
SeongJoon Cho—Bloomberg via Getty Images

City News Bureau of Chicago, a now-defunct news outfit once legendary as a training ground for tough-as-nails, shoe-leather reporters, famously had as its unofficial motto: “If your mother says she loves you, check it out.” Thanks to the advent of ChatGPT, the new Bing Search, Bard, and a host of copycat search chatbots based on large language models, we are all going to have to start living by City News’ old shibboleth.

Researchers already knew that large language models were imperfect engines for search queries, or any fact-based request really, because of their tendency to make stuff up (a phenomenon A.I. researchers call “hallucination”). But the world’s largest technology companies have decided that the appeal of dialogue as a user interface—and the ability of these large language models to perform a vast array of natural language-based tasks, from translation to summarization, along with the potential to couple these models with access to other software tools that will enable them to perform tasks (whether it is running a search or booking you theater tickets)—trumps the potential downsides of inaccuracy and misinformation.

Except, of course, there can be real victims when these systems hallucinate—or even when they don’t, but merely pick up something that is factually wrong from their training data. Stack Overflow had to ban users from submitting answers to coding questions that were produced using ChatGPT after the site was flooded with code that looked plausible but was incorrect. The science fiction magazine Clarkesworld had to stop taking submissions because so many people were submitting stories crafted not by their own creative genius, but by ChatGPT. Now a German company called OpenCage—which offers an application programming interface that does geocoding, converting physical addresses into latitude and longitude coordinates that can be placed on a map—has said it has been dealing with a growing number of disappointed users who have signed up for its service because ChatGPT erroneously recommended its API as a way to look up the location of a mobile phone based solely on the number. ChatGPT even helpfully wrote python code for users allowing them to call on OpenCage’s API for this purpose.

But, as OpenCage was forced to explain in a blog post, this is not a service it offers, nor one that is even feasible using the company’s technology. OpenCage says that ChatGPT seems to have developed this erroneous belief because it picked up on YouTube tutorials in which people also wrongly claimed OpenCage’s API could be used for reverse mobile phone geolocation. But whereas those erroneous YouTube tutorials only convinced a few people to sign up for OpenCage’s API, ChatGPT has driven people to OpenCage in droves. “The key difference is that humans have learned to be skeptical when getting advice from other humans, for example via a video coding tutorial,” OpenCage wrote. “It seems though that we haven’t yet fully internalized this when it comes to AI in general or ChatGPT specifically.” I guess we better start internalizing.

Meanwhile, after a slew of alarming publicity about the dark side of its new, OpenAI-powered Bing chat feature—where the chatbot calls itself Sydney, becomes petulant, and at times even downright hostile and menacing—Microsoft has decided to restrict the length of conversations users can have with Bing chat. But as I, and many others have found, while this arbitrary restriction on the length of a dialogue apparently makes the new Bing chat safer to use, it also makes it a heck of a lot less useful.

For instance, I asked Bing chat about planning a trip to Greece. I was in the process of trying to get it to detail timings and flight options for an itinerary it had suggested when I suddenly hit the “Oops, I think we’ve reached the end of this conversation. Click ‘New topic,’ if you would!”

The length restriction is clearly a kluge that Microsoft has been forced to implement because it didn’t do rigorous enough testing of its new product in the first place. And there are huge outstanding questions about exactly what Prometheus, the name Microsoft has given to the model that powers the new Bing, really is, and what it is really capable of (no one is claiming the new Bing is sentient or self-aware, but there’s been some very bizarre emergent behavior documented with the new Bing, even beyond the Sydney personality, and Microsoft ought to be transparent about what it understands and doesn’t understand about this behavior, rather than simply pretending it doesn’t exist). Microsoft has been cagey in public about how it and OpenAI created this model. No one outside of Microsoft is exactly sure why it is so prone to taking on the petulant Sydney persona, especially when ChatGPT, based on a smaller, less capable large language model, seems so much better behaved—and again, Microsoft is saying very little about what it does know.

(Earlier research from OpenAI had found that it was often the case that smaller models, trained with better quality data, produced results that human users much preferred even though they were less capable when measured on a number of benchmark tests than larger models. That has led some to speculate that Prometheus is OpenAI’s GPT-4, a model believed to be many times more massive than any it has previously debuted. But if that is the case, there is still a real question about why Microsoft opted to use GPT-4 rather than a smaller, but better-behaved system to power the new Bing. And frankly, there is also a real question about why OpenAI might have encouraged Microsoft to use the more powerful model if it in fact realized it had more potential to behave in ways that users might find disturbing. The Microsoft folks may have, like many A.I. researchers before them, become blinded by stellar benchmark performance that can convey bragging rights among other A.I. developers, but which are a poor proxy for what real human users want.)

What is certain is that if Microsoft doesn’t fix this soon—and if someone else, such as Google, which is hard at work trying to hone its search chatbot for imminent release, or any of the others, including startups such as Perplexity and, that have debuted their own chatbots, shows that their chatbot can hold long dialogues without it turning into Damien—then Microsoft risks losing its first mover advantage in the new search wars.  

Also, let’s just take a moment to appreciate the irony that it’s Microsoft, a company that once prided itself, not without reason, on being among the most responsible of the big technology companies, which has now tossed us all back to the bad old “move fast and break things” days of the early social media era—with perhaps even worse consequences. (But I guess when your CEO is obsessed with making his arch-rival “dance” it is hard for the musicians in the band to argue that maybe they shouldn’t be striking up the tune just yet.) Beyond OpenCage, Clarkesworld, and Stack Overflow, people could get hurt from incorrect advice on medicines, from abusive Sydney-like behavior that drives someone to self-harm or suicide, or from reinforcement of hateful stereotypes and tropes.

I’ve said this before in this newsletter, but I’ll say it again: Given these potential harms, now is the time for governments to step in and lay down some clear regulation about how these systems need to be built and deployed. The idea of a risk-based approach, such as that broached in the original draft of the European Union’s proposed A.I. Act, is a potential starting point. But the definitions of risk and those risk assessments should not be left entirely up to the companies themselves. There need to be clear external standards and clear accountability if those standards aren’t meant.

With that, here’s the rest of this week’s A.I. news.

Jeremy Kahn


Partnership on A.I. publishes framework for ethical creation of synthetic media. The advocacy group, which counts most big American tech companies, as well as a slew of universities and non-governmental groups among its membership, released a set of best practices and a framework for companies using A.I. to create synthetic media. Transparency is at the heart of much of the framework with the document saying that those encountering synthetic media should always be aware they are not seeing a real image and that companies using synthetic media should also, through the use of digital watermarks or other technology, make it very easy to detect synthetic media. But, as always with PAI’s frameworks, they are just recommendations with no way of enforcing compliance among the group’s membership and no call for action beyond self-governance.

Snap is releasing its own chatbot powered by ChatGPT. That’s according to a story in the tech publication The Verge. The “My AI” bot will be available to users of Snap’s subscription Snapchat Plus service for $3.99 a month. “The big idea is that in addition to talking to our friends and family every day, we’re going to talk to A.I. every day,” Snap CEO Evan Spiegel told the publication. “And this is something we’re well positioned to do as a messaging service.” Snap says it has trained the version of ChatGPT that powers “My AI” to adhere to Snap’s trust and safety guidelines and has also tried to make it harder for students to use the chatbot to cheat at school.

The International Baccalaureate allows students to use ChatGPT to craft essays. The degree program, which is used by many private international high schools, will allow students to use the OpenAI-developed chatbot to write essays so long as the students don’t attempt to pass the work off as their own, Matt Glanville, head of assessment principles and practice at the IB, told the Times of London. It said that over the long run, however, the program would reduce its reliance on take-home essays and reports in favor of in-class assignments.

Tesla pauses rollout of Full Self-Driving to new users. The company has been forced to stop rolling out its Full Self-Driving software to new drivers while it tries to fix problems with the software that the U.S. National Highway Traffic Safety Administration said were unsafe and error-prone, tech publication The Register reported. Among the problems the faulty software could cause were causing a car to drive straight through an intersection from a turn-only lane, fail to fully stop at a stop sign, and veer into on-coming traffic.

Company behind popular Lensa app sued for violating Illinois biometric data law. Prisma Labs, the company that created the popular Lensa app, which uses open-source text-to-image generative-A.I. system Stable Diffusion to create digital avatars from people’s selfies, faces a federal class action lawsuit filed in California that alleges it violates Illinois’s strict biometric data protection law by collecting and storing users’ facial geometry without consent, Bloomberg reported. Prisma Labs did not immediately respond to requests for comment on the lawsuit and its allegations.

Legal tech startup powered by Anthropic’s A.I. lands funding from prominent European founders. The company, called Robin AI, announced a $10.5 million Series A round led by Taavet Hinrikus, a co-founder of financial technology company Wise and an early engineer at Skype, and Ian Hogarth, who cofounded concert discovery site Songkick, according to a story in the European technology publication Sifted. Robin has created software, based on Anthropic’s large language models, that can draft and edit legal contracts. Anthropic was created by a team that broke away from OpenAI in 2021 and is competing with OpenAI in the creation of large “foundation” models and generative A.I. Robin is competing with a number of legal startups, including Harvey AI, which received $5 million in Series A funding, in part from OpenAI’s own startup fund, and CaseText, that have been using OpenAI’s A.I. to create “co-pilots” for the legal profession.


Meta unveils an open-source large language model family in challenge to OpenAI. The social media company is making several versions of a large language model it calls LLaMA available to academics, civil society, policymakers, and the public to use in research and to build free applications, it said in a blog post. The largest of the LLaMA models is 65 billion parameters, which is about a third of the size of OpenAI’s GPT-3, but Meta says that LLaMA performs as well or better than GPT-3 on many tasks. LLaMA comes at a time when there is growing concern that university researchers and government institutions will have difficulty using the largest class of “foundation models” because they are so large that only massive technology companies can afford to train and run them. The service terms for LLaMA state that the models cannot be used for commercial products.


Elon Musk and Tesla face a fresh lawsuit alleging his self-driving tech is a fraud—by Christiaan Hetzner

Amazon driver breaks down the A.I. system watching workers for safety violations like drinking coffee while driving and counting the times they buckle their seatbelt—by Orianna Rosa Royle

A.I. firms are trying to replace voice actors, and they’re getting help from voice actors to do it—by Steve Mollman


Sam Altman has thoughts about AGI—and people have thoughts about Sam's thoughts. Altman, OpenAI’s cofounder and CEO, wrote a blog post four days ago in which he tried to outline OpenAI’s approach to artificial general intelligence, the über-powerful form of A.I. that OpenAI was founded to create. Altman’s blog generated a lot of attention, some of it laudatory, much of it critical. (Altman’s blog may in fact be one of the things that prompted Elon Musk to tweet that he’s been experiencing a lot of angst about AGI.) Emily Bender, the University of Washington computational linguist who has been on a mission to pierce much of the hype around today’s A.I., particularly large language models, has a scathing critique of Altman’s post. Bender’s take has received a lot of attention and is worth reading, even if you don’t agree with all of her criticism. I happen to agree with a lot of what Bender says about Altman’s rhetorical sleight-of-hand in positioning today’s LLM-based models, including ChatGPT, as being on the path to AGI. But I think there is a key paragraph buried deep in Altman’s blog that has not received as much attention as it should have. It is where Altman says the following:

We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models. We think public standards about when an AGI effort should stop a training run, decide a model is safe to release, or pull a model from production use are important. Finally, we think it’s important that major world governments have insight about training runs above a certain scale.” (Bolding mine.)

This should be much bigger news. In essence, OpenAI is beginning to tip-toe into the idea of some kind of governmental entity, perhaps even an international body, licensing the training of models above a certain size. (The line about advanced efforts "agreeing to limit the rate of growth" sounds like industry-driven self-regulation, which I doubt will work. But an international body could potentially enforce such a mechanism.) There might even be a prohibition or a temporary moratorium on the development of certain kinds of models beyond a certain size. And because these ultra-massive models require huge amounts of data center infrastructure, it might actually be possible for governments to enforce these prohibitions, much as bodies like the International Atomic Energy Agency monitors and inspects nuclear facilities around the world. These large data centers are not so easy to hide. Software might exist in the ether—but hardware is a real physical thing.

This is an idea that even the critics of large language models might be able to get behind—not because they are worried about AGI, but because they think that LLMs are hugely wasteful pieces of technology that amplify existing societal biases and historical prejudices, make global inequality worse, and ruin the planet with their massive carbon footprint. If there were a national or international body regulating the training of ultra-large models, the body could potentially take action, stepping in and doing what Bender and other critics of the current wave of LLM development have long advocated—stop further development of A.I. systems based on ultra-large models.

Meanwhile, if you do worry about AGI and its potential ramifications, having a national or international body that is at least thinking about this and how to avoid a doomsday scenario is no bad thing. We have international agreements, of various kinds, regulating nuclear technology, certain advanced biological research, and the trade in certain chemicals. It is probably time to start thinking about advanced A.I. in the same way.

This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays and Fridays. Sign up here.

Read More

CEO DailyCFO DailyBroadsheetData SheetTerm Sheet