‘Sentient’ chatbot story shows why it’s time for A.I. to retire the Turing Test

June 14, 2022, 6:23 PM UTC

What to make of the strange case of Blake Lemoine? The Google A.I. engineer made headlines this past week when he claimed one of the company’s chatbots had become “sentient,” which, if it were true, would be an earth-shattering achievement meaning the technology had become conscious or self-aware. Furthermore, he said that he had been suspended from his job for raising ethical concerns internally about the bot’s treatment.

Google responded that it had investigated Lemoine’s claims about the bot becoming sentient and concluded they were groundless. It also said it has suspended Lemoine with pay not because he raised ethical concerns, but because he breached confidentiality by leaking transcripts of his dialogues with the bot and other information to the press and U.S. Congressional staff, and that he had engaged in provocative actions, including trying to hire a lawyer to represent the chatbot’s interests.

The interesting thing about this incident are not Lemoine’s claims: Not only does Google say they are untrue, almost every A.I. expert agrees that Lemoine is simply deluded. The chatbot, which Google calls LaMDA, is not sentient in the way Lemoine says it is. And, in fact, the experts were practically unanimous in asserting that it’s impossible for a chatbot constructed in the way LaMDA is to acquire these attributes.

No, the more interesting things about this case are all meta-issues: What does it say about the hype around A.I. that Lemoine’s story was able to gain such traction? Should companies such as Google, OpenAI, DeepMind, and Meta bear some responsibility for Lemoine’s misapprehension because of how they have inflated perceptions of what A.I. can do and about its near-term possibilities? What does it say about A.I. journalism that The Washington Post chose to run a long and largely credulous profile of Lemoine? I’ll leave all those aside for today and talk about another meta-issue that has more bearing on today’s business uses of A.I.: Is Lemoine the inevitable result of the field’s persistent fetishization of the Turing Test as a benchmark?

The Turing Test was first proposed by esteemed British mathematician and computing pioneer Alan Turing in 1950. It posited that a machine could be said to be intelligent if it could succeed at a contest Turing called “the imitation game.” The game involved a human having a dialogue with a chatbot-like machine in a separate room across a networked connection, with another human, separated from both interlocutors, reading the transcript of the exchange in real-time. If the observer could not determine which text had been typed by the human and which generated by the machine, then the machine could be considered intelligent.

There are many problems with the Turing Test. Gary Marcus, the former New York University cognitive psychology professor and a critic of today’s leading approaches to artificial intelligence, highlighted many of them in a 2014 article in The New Yorker. They include the fact that humans are actually easily deceived and tend to anthropomorphize, which means the test is actually easier for a machine to pass than it ought to be. In many actual versions of the Turing Test, humans often simply don’t try that hard to stump the machine. People too easily ascribe a mind like their own to all sorts of things, from pets to pet rocks.

In many cases, people are eager to deceive themselves into thinking the bots are real. Take Eugenia Kudya or Joshua Barbeau, both of whom, in desperate bereavement, trained chatbots to mimic the messaging patterns of recently-deceased loved ones. (Whether their conversations with these chatbots brought them comfort or simply prolonged the intensity of their grief is unclear.) Microsoft recently patented a chatbot that is designed to imitate the conversational style of a deceased person, a celebrity, or a fictional character. (The will to self-deception evident in Kudya’s and Barbeau’s cases may have been a factor in Lemoine’s belief that LaMDA had become sentient too.)

More importantly, Turing’s basic idea, that a machine that was good enough to fool a human into thinking it was a person in a dialogue would have to possess many other characteristics of intelligence, has been proven wrong. As Marcus has noted, arguably the first software to pass the Turing Test was ELIZA, created in 1965 at MIT by Joseph Weizenbaum and designed to mimic the kinds of open-ended, non-directional questions a Freudian psychoanalyst might pose to a patient. ELIZA tricked a lot of students into believing it was a real therapist at first, although it could not hold a very long convincing conversation. More recently, in 2014, “Eugene Goostman,” a chatbot designed to mimic the musing of a 13-year-old boy, successfully fooled human judges in a stripped-down version of the Turing Test. Neither of those two examples, Marcus says, brought us closer to artificial general intelligence (AGI)—the kind of A.I. software that can perform a variety of disparate tasks as well or better than a human.

But the Turing Test’s most troubling legacy is an ethical one: The test is fundamentally about deception. And here the test’s impact on the field has been very real and disturbing. DeepMind, when it wanted to test a version of its AlphaGo A.I. in late 2015 and early 2016, simply set a version of it loose in an online Go forum under the moniker “Master,” where it won 60 games in a row against unsuspecting human pro players. When, in 2018, Google wanted to showcase how good its Duplex digital assistant had become, and how realistically it was able to mimic human speech, complete with pauses and filler word like “um” and “hmmm,” it did so by setting the system loose to interact with unsuspecting humans. It then held a big media event where it played audio recordings of Duplex making reservations at a restaurant and a hair salon, boasting about how the people on the other end of the calls had no idea they were speaking with a piece of software. DeepMind seems to have escaped criticism for its deception, but Google was roundly condemned by A.I. ethicists for its Duplex trials and the company was forced to say that in the future Duplex would always identify itself as A.I.

The need to always inform people when they are interacting with an A.I. system is one of the key principals in a set of recommended best practices released earlier this month by three leading companies working on language-generating A.I. systems. OpenAI, Cohere, and AI21 Labs collaborated on a set of principles for ethical deployment of ultra-large language models, the kind of A.I. system that is behind the LaMDA chatbot that Lemoine claimed had become sentient.

Among the other principals are the idea that companies creating large language models should publish clear conditions governing their usage and develop mechanisms to verify and enforce that customers are abiding by those terms. That the companies building these A.I. system should take active steps at all stages of development to mitigate unintentional harm, including attempting to remove biased, violent or extremist content from training data, incorporate human feedback to guide the systems toward appropriate outputs, and carefully probe for and document areas where the A.I. may produce unfair or discriminatory outcomes or generate racist or toxic language.

Finally, the three companies suggest that all companies creating large language models build diverse teams, treat all human labor involved in building the A.I. system, including those employed to label data or provide feedback on the model’s outputs, with respect, and publish lessons learned about A.I. safety and misuse.

It remains to be seen how many other companies will sign up to these recommendations and whether regulatory bodies may pick up on them. In the meantime, the industry may be well served by retiring the Turing Test once and for all.
 

Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

A.I. IN THE NEWS

Lawsuits allege Meta's algorithms controlling users' feeds cause kids harm. Lawsuits alleging that exposure to Facebook and Instagram led teenagers to attempt or commit suicide, contributed to eating disorders and sleeplessness, among other issues, have been filed in federal courts in Texas, Tennessee, Colorado, Delaware, Florida, Georgia, Illinois and Missouri, according to Bloomberg News. Meta declined to comment on the lawsuits but a spokesperson told the news agency that the company has tools in place that let parents limit the amount of time their kids spend on its platforms.

Regulators concerned that Tesla may be obscuring how often its autopilot tech is implicated in crashes. The National Highway Traffic Safety Administration said it would expand its probe into possible faults with Tesla's vehicles to include some 830,000 vehicles across all four current models the company sells, my Fortune colleague Christiaan Hetzner reports. The expansion also signals that the agency may be moving towards a recall of Tesla vehicles. The regulator is concerned that Tesla's Autopilot technology may be leading drivers to engage in unsafe behavior and it also said it had documented at least 16 cases in which the Autopilot disengaged less than one second prior to impact" suggesting humans were not prepared to take back over control of the car, and undercutting Tesla CEO Elon Musk's contention that Autopilot cannot be faulted for crashes because no data has ever shown that it was in control of the vehicle at the moment of impact.

Startup Nate used contract workers in the Philippines, not A.I., to complete shopper's payment fields. The startup, which was valued at over $300 million in venture capital funding rounds, claimed to use A.I. to populate the fields in checkout and payment screens for various online retailers. But according to a story in The Information, which cited people familiar with the company's operations, Nate's tech did not work reliably and the company often resorted to hiring low-paid contractors in the Philippines to manually fill in the information. The company said these claims were "incorrect" and "completely baseless."

Swedish game giant King acquires no-code A.I. platform Peltarion. Peltarion, based in Stockholm, was among the many companies trying to make it easier for large companies to deploy machine learning through a no-code platform. Previously, the company had worked on bespoke machine learning solutions for the likes of NASA, online grocery delivery company Ocado, and Tesla. It had raised $37 million from investors including EQT Ventures and Euclidean Capital, the family office for hedge fund billionaire James Simons. But this week, King, the Swedish gaming company behind mobile game Candy Crush, acquired Peltarion for an undisclosed amount, according to a press release

Israeli A.I. company AI21 Labs creates a Ruth Bader Ginsburg bot. The company, which specializes in natural language processing A.I., trained its dialogue agent, which it calls Ask Ruth Bader Ginsburg, on the late Supreme Court Justice's writings, The Washington Post says. “We wanted to pay homage to a great thinker and leader with a fun digital experience,” the company says on the A.I. app’s website. “It is important to remember that AI in general, and language models specifically, still have limitations.” 

EYE ON A.I. RESEARCH

OpenAI pilots self-critiquing A.I. systems to aid human evaluators. OpenAI has been using summaries written by a fine-tuned version of its ultra-large language model as a kind of sandbox for working on what's known as "the Alignment Problem"—namely how to get A.I. to do what people want it to do and not do what we don't want it to do. But a problem is that this summarization work depends on human evaluators to provide feedback on the summaries the A.I. system produces. And for the human evaluators, this can be taxing work. Now OpenAI has created a system in which the same large language model that is producing the summary is also trained to critique its own output, and these critiques are then given to humans to help them in their own evaluations of the same summaries. (They did this through a supervised learning process, in which the A.I. was fine-tuned on human-written critiques of summaries.) OpenAI found that the self-critiques helped the humans spot problems in the summaries that may otherwise have missed. They also found that as the A.I. systems got larger, they got better and faster at writing the critiques than they did at actually producing better summaries. You can read more on OpenAI's blog on the research. Why does this matter? Summarization is a key skill for many real-world use cases of automatic document analysis that could be useful in settings ranging from education to finance to legal affairs and defense. 

FORTUNE ON A.I.

Artificial intelligence may be the only way researchers can solve the perplexing puzzle of Long COVID. It’s already categorizing patients and even identifying them—by Erin Prater

A.I. experts say the Google researcher’s claim that his chatbot became ‘sentient’ is ridiculous—but also highlights big problems in the field—by Jeremy Kahn

You can now put A.I. tools in the hands of all your employees. But should you?—by Francois Candelon, Maxime Courtaux, and Gabriel Nahas

Meet the Meta executive tasked with bringing Mark Zuckerberg’s high-stakes metaverse vision to life—by Jonathan Vanian and Jeremy Kahn 

BRAIN FOOD

How would we even know if an A.I. system were sentient? Thomas Dietterich, an emeritus professor of computer science at Oregon State University and a well-respected machine learning authority, posted an interesting thread on Twitter in response to the Blake Lemoine story. In it, he said many of those discussing the Lemoine claims were actually conflating several definitions of "sentience," and that actually any system that can take in feedback from a sensor and respond to that feedback is technically "sentient," even though you would never ascribe any intelligence, let alone personhood or rights, to such a device. He said that when it came to the idea of sentience as "feeling," this was much more difficult, and perhaps impossible, to assess in a non-biological system. "I have no idea what it would mean to program a computer to feel these things or how we would assess that we had succeeded in doing so," Dietterich wrote.

Miles Brundage, who works on policy research at OpenAI, noted in a Twitter exchange, that some of the discussion of A.I. systems such as LaMDA (and also OpenAI's GPT-3 and DALL-E) conflate the ideas of intelligence, creativity and consciousness. "Creativity, intelligence, consciousness, etc are all distinct things (they may be empirically correlated in humans/non-human animals but things could look v different for AI). I consider some current AI systems somewhat creative/intelligent but think evidence for more is lacking," he wrote.

Our mission to make business better is fueled by readers like you. To enjoy unlimited access to our journalism, subscribe today.

Read More

CEO DailyCFO DailyBroadsheetData SheetTerm Sheet