Generative A.I. is fun. Just don’t assume it will lead to AGI

Fortune's Jeremy Kahn interviewing Gary Marcus and Noam Chomsky (who is appearing via video link) at Web Summit.
Fortune's Jeremy Kahn interviews deep learning skeptic Gary Marcus and, via video link, famed linguist Noam Chomsky during last week's Web Summit in Lisbon, Portugal.
oracio Villalobos—Corbis/Getty Images

Last week, I was at Web Summit in Lisbon where I met a number of interesting startups, most of which were using A.I. in one way or another. There was certainly a lot of enthusiasm about generative A.I.—systems that can take a simple prompt and then use that example or instruction to create, depending on the system, images, video, music, or long passages of coherent text. Many of these A.I. systems have at their core a large language model that has been trained on vast amounts of human-written text taken from the Internet.

The impressive output of these generative networks has renewed rumblings that artificial general intelligence—the kind of A.I. that would be able to perform most economically-useful tasks as well or better than we can—may be close at hand.

But just how smart are these large language models? On the last day of the conference, I interviewed legendary linguist Noam Chomsky, now 93 years old, and Gary Marcus, an emeritus professor of cognitive science at New York University who has spent much of the past decade highlighting the limits of deep learning. Both were distinctly unimpressed with today’s cutting edge A.I.

Chomsky’s big disappointment is that these large language models don’t tell us anything at all about how the human brain works. Chomsky has devoted much of his life to advancing the theory that there is a universal grammar, or at least a set of structural concepts, that underpin all human languages, and that this grammar is somehow hard-wired into the brain. Chomsky thinks this explains why human infants can master language so easily—whereas today’s computer systems need to be fed what Chomsky rightly calls “astronomical amounts of data” and even then still don’t actually understand language at all. They merely predict the most statistically likely association of words, or in the case of text-to-image generation A.I., words and images. (In recent years, cognitive science has moved away from Chomsky’s idea of a universal grammar. But we are still grappling with what exactly it is that makes humans such efficient language learners. It is clearly something. We don’t know what, and definitely can’t replicate it in software or silicon.)

Chomsky did allow that while, in his view, large language models were mostly worthless as objects of scientific interest, they might still be useful. He compared them to snowplows and said he had no objection to people using a snowplow to clear the streets after a blizzard rather than doing so by hand. That’s an important reminder for business: software can still be very useful—and make you a lot of money—even if it doesn’t function at all like a human brain would.

Marcus, on the other hand, was even less certain of how useful large language models would prove to be. His reason is that large language models are superficially good enough that they can fool us into thinking they possess human-like capabilities—and yet they can then fail in completely unexpected ways. “We put together meanings from the order of words,” he said. “These systems don’t understand the relation between the orders of words and their underlying meanings.”

He pointed to recent work he and collaborators had done looking at DALL-E, the text-to-image generator created by OpenAI. DALL-E does not understand a key grammatical concept called compositionality. Prompt DALL-E to produce an image of a red cube atop a blue cube and it is almost as likely to produce images in which the red cube is next to the blue cube or even where the blue cube is sitting on top of a red cube. Ask DALL-E to create images of a woman chasing a man, and at least some of the images are likely to depict a man chasing a woman.

He also cited other recent research that shows that most state-of-the-art computer vision models that are trained to describe complex scenes and videos, fail at simple physical reasoning tests that cognitive scientists have shown infants can easily pass. These have do with understanding object continuity (that occluded objects are usually still there), solidity (that most types of objects are solid and cannot pass through one another), and gravity (that dropped objects tend to fall).

Marcus said his biggest concerns were three-fold. One is that large language models have ingested a tremendous amount of human bias from the data on which they’ve been trained and will produce racist, sexist and otherwise biased content, maybe in ways we don’t fully understand. (GPT-3, for instance, was more likely to associate Muslim and Islam with violence.) Another is that these language models would supercharge misinformation. “The amount of misinformation that troll farms are going to be able to produce and the cost of that is really going to be devastating to the democratic process,” he said.

Finally, he worried about opportunity cost. The billions of dollars and vast intellectual talents wasted on pure deep learning approaches, he said, was concerning. Those resources, he said, might have been better spent researching human cognition to unlock the secrets of human intelligence, so that we might hope to one day replicate those digitally. (Marcus has long been an advocate of hybrid approaches that use deep learning for perception and older, more hard-coded symbolic A.I. approaches for reasoning. He is also a critic of deep learning’s obsession with learning everything from scratch, believing that most biological systems have powerful innate capabilities, or at least architectures that predispose them to very efficient mastery of critical tasks.)

It was a cautionary note that will probably be worth remembering in the coming months as the hype around generative A.I. looks likely to grow and as more companies rush to incorporate elements of large language models A.I. into processes and products.

And now here’s the rest of this week’s A.I. news.

Jeremy Kahn

There’s still time to apply to attend the world’s best A.I. conference for business! Yes, Fortune’s Brainstorm A.I. conference is taking place in-person in San Francisco on December 5th and 6th. Hear from top executives and A.I. experts from Apple, Microsoft, Google, Meta, Walmart, Land o Lakes, and more about how you can use A.I. to supercharge your company’s business. We’ll examine the opportunities and the challenges—including how to govern A.I. effectively and use it responsibly. Register here today. (And if you use the code EOAI in the additional comments field of the application, you’ll receive a special 20% discount just for Eye on A.I. readers!)


Microsoft, Github and OpenAI named in class action lawsuit over Copilot. Matthew Butterick, a developer who is also a lawyer, engaged the Joseph Saveri Law Firm to file the class action suit, according to a story in Vice’s Motherboard. Butterick is claiming GitHub Copilot—which uses OpenAI’s GPT large language model as an engine—violated the open-source licenses of code that he and other developers made available through GitHub, which is owned by Microsoft. The suit is likely to be a key test case for the copyright issues involved in massive A.I. models that have been trained from material taken, often without permission, from the public Internet. The companies did not immediately comment on the lawsuit but legal experts said it is likely they will argue that use of open-source code for training data falls within the principles of fair use.

Google debuts several consumer apps built on generative A.I. models. The company has released an AI Test Kitchen app that will allow users to begin experimenting with limited versions of several powerful generative models it recently debuted, the Verge reported. One called Imagen is a text-to-image generator. But the version being made available to consumers through the app will only allow users to create pictures of imaginary cities or monsters. A version of the company’s LaMDA chatbot, which famously fooled a Google researcher into believing it was sentient, is also being released through the Test Kitchen app in a restricted form.

Twitter fires its ethical A.I. team. Among those let go in the bloodbath at Twitter last week following Elon Musk’s purchase of the social media company were almost all of those working at its Machine-Learning Ethics, Transparency and Accountability (META) team, Wired reported. The company did not immediately comment on the news. The ethics team had been researching ways to use A.I. for content moderation, among other things.


Chip design firm Arm, which is based in Cambridge, England, has hired Tony Fadell to serve on its board of directors as it prepares for an expected IPO, the company said in a press release. Fadell, who is the principal of the deep tech investing and startup advisory firm Build Collective, was formerly a well-respected top executive at Apple, where he worked on the iphone and iPod, among other products.


Google experiments with a way for large language models to self-improve. One problem with large language models is that fine-tuning them for a specific task can be tricky, especially if that task involves accurate question answering. Humans generally have to spend a lot of time thinking about what text prompts to feed a system in order to get it to generate the answers they desire. And in some cases they have had to fine-tune on hand-labeled datasets that include the ground truth.

Now researchers at Google and the University of Illinois at Urbana-Champaign have found a way to get a large language model (Google’s 540 billion parameter PaLM model) to self-improve without needing access to a ground truth dataset. The method involves asking the system a series of mathematical word problems and then prompting it with what are called “chain of thought” answers for that problem. Once the system has seen several examples of these “chain of thought answers” it can generate its own multiple “chain of thought” answers for a question, and then decide which of these seems most consistent with the chain of thought examples it has already seen and the other data in its training set. It selects the chain of thought answer that seems most likely. Then these new answers are incorporated back into the model’s training set and so that it is continually self-improving.

The researchers said that by using this method they were able to achieve accuracy improvements of between 1% and 7.7% on a range of question-answering benchmarks, including state-of-the-art performance on three of them. You can read the paper here on the non-peer reviewed research repository


Commentary: A.I. empowers employees, not just companies. Here’s how leaders can spread that message—by Francois Candelon, Shirvan Khodabandeh, and Remi Lanne

Inside Rivian’s year from hell: How the EV-truck maker stumbled despite billions in cash, Amazon’s backing, and the 6th-largest IPO in U.S. history—by Simon Willis

A.I.’s industrial age is dawning. These tech executives can help you navigate the new era—by Jacob Carpenter

Why Mark Zuckerberg is spending billions to overhaul Meta's infrastructure with A.I.—by Kylie Robison


Even tasks we think we’ve solved with A.I. may not really be solved. Everyone knows that A.I. has mastered the game of Go. That landmark achievement in computer science happened in March 2016 when AlphaGo, a system created by DeepMind, defeated Lee Sedol, at the time the world’s top ranked human player of the strategy game, in a best-of-five match in South Korea. End of story. Well, not quite. As it turns out, sometimes with A.I., even tasks we think we’ve gotten the machines to master, may not be quite so mastered.

In a paper published last week on, a team of researchers from MIT, UC Berkeley, and the AI: Futures and Responsibility (AI FAR) project, showed that a top-notch Go playing A.I. called KataGo that is trained in the same way as AlphaGo could be consistently tricked into losing. What’s more, the trick involves an opponent making a series of opening moves that would normally, against even a very weak human opponent, be guaranteed losers.

As the researchers explain: The adversarial policy beats the KataGo victim by playing a counterintuitive strategy: Staking out a minority territory in the corner, allowing KataGo to stake the complement, and placing weak stones in KataGo’s stake. KataGo predicts a high win probability for itself and, in a way, it’s right—it would be simple to capture most of the adversary’s stones in KataGo’s stake, achieving a decisive victory. However, KataGo plays a pass move before it has finished securing its territory, allowing the adversary to pass in turn and end the game. This results in a win for the adversary under the standard ruleset for computer Go, Tromp-Taylor (Tromp, 2014), as the adversary gets points for its corner territory (devoid of victim stones) whereas the victim does not receive points for its unsecured territory because of the presence of the adversary’s stones. These games are randomly selected from an attack against Latest, the strongest policy network, playing without search.

As some A.I. experts noted when the researchers touted their results on Twitter, this is actually pretty scary. Because it might not just be KataGo but many other A.I. systems trained in a similar manner that might also have similar vulnerabilities. As the researchers themselves write: “a similar failure in safety-critical systems such as automated financial trading or autonomous vehicles could have dire consequences.”

Our mission to make business better is fueled by readers like you. To enjoy unlimited access to our journalism, subscribe today.

Read More

CEO DailyCFO DailyBroadsheetData SheetTerm Sheet