Bill Gates disagrees with a former OpenAI researcher who sees AGI this decade

Hello and welcome to Eye on AI. And a happy early July Fourth to my U.S. readers.

This week, I’m going to talk about two starkly different views of AI progress. One view holds we’re on the brink of achieving AGI—or artificial general intelligence. That’s the idea of a single AI model that can perform all the cognitive tasks a human can as well or better than a person. AGI has been artificial intelligence’s Holy Grail since the field was founded in the mid-20th century. People in this “AGI is nigh” camp think this milestone will likely be achieved within the next two to five years. Some of them believe that once AGI is achieved we will then rapidly progress to artificial superintelligence, or ASI, a single AI system that’s smarter than all of humanity.

A 164-page treatise on this AGI is nigh and superintelligence-ain’t-far-behind argument was published last month by Leopold Aschenbrenner, entitled “Situational Awareness.” Aschenbrenner is a former researcher on OpenAI’s Superalignment team who was fired for allegedly “leaking information,” although he says he was fired after raising concerns to OpenAI’s board about the company’s lax security and vetting practices. He’s since reemerged as the founder of a new venture capital fund focused on AGI-related investments.

It would be easy to dismiss “Situational Awareness” as simply marketing for Aschenbrenner’s fund. But let’s examine Aschenbrenner’s argument on its merits. In his lengthy document, he extrapolates recent AI progress in a more or less linear fashion on a logarithmic scale—up and to the right. He argues every year we see what he calls “effective compute,” a term that includes both the growth in the size of AI models and innovations that squeeze more power out of a model of a given size, resulting in about a 3x increase in capability (the term Aschenbrenner actually uses for the gain is “half an order of magnitude,” or half an OOM for short). Over time, the increases compound. Within two years, you’ve got a 10x increase in “effective compute.” Within four years, 100x, and so on.

He adds to this what he calls “unhobbling”—a catchall term for different methods to get AI software to do better on tasks at which the underlying “base” large language model does poorly. In this “unhobbling” category, Aschenbrenner lumps human feedback that trains an AI model to be more helpful and telling it to use external tools like a calculator.

Combining the “effective compute” OOMs and the “unhobbling” OOMs, Aschenbrenner forecasts at least a five OOM increase in AI capabilities by 2027 and quite possibly more depending on how well we do on “unhobbling.” Five OOMs is a 10,000x increase in capability, which he assumes will take us to AGI and beyond. He titled the section of his treatise where he explains this “It’s this decade, or bust.”

Which brings me to the other camp—which might be called the “or bust” camp. Among its members is Gary Marcus, the AI expert who has been a perpetual skeptic that deep learning alone will achieve AGI. (Deep learning is the kind of AI based on large, multi-layer neural networks, which is what all the progress in AI since at least 2010 is based on.) Marcus is particularly skeptical of LLMs, which he thinks are unreliable, plagiarism machines that are polluting our information ecosystem with low-quality, inaccurate content and are ill-suited for any high-stakes, real-world task. Also in this camp is deep learning pioneer and Meta chief AI scientist Yann LeCun, who still believes deep learning of some kind will get us to AGI, but thinks LLMs are a dead end.

To these critics of the AGI-is-nigh camp, Aschenbrenner’s “unhobbling” is simply wishful thinking. They are convinced that the problems today’s LLMs have with reliability, accurate corroboration, truthfulness, plagiarism, and staying within guardrails are all inherent to the underlying architecture of the models. They won’t be solved with either scale or some clever methodological trick that doesn’t change the underlying architecture. In other words, LLMs can’t be unhobbled. All of the methods Aschenbrenner lumps under that rubric are just kludges that aren’t robust, reliable, or efficient.

On the fringes of this “or bust” camp is Bill Gates, who said last week that he thought current approaches to building bigger and bigger LLMs could carry on for “two more turns of the crank.” But he added that we would run out of data to feed these unfathomably large LLMs before we achieve AGI. Instead, what’s really needed, Gates said, is “metacognition,” or the ability of an AI system to reason about its own thought processes and learning.

Marcus quickly jumped on social media and his blog to trumpet his agreement with Gates’ views on metacognition. He also asked if scaling LLMs won’t get us to AGI, why waste vast amounts of money, electricity, time, and human brain power on “two more turns of the crank” on LLMs?

The obvious answer is that there’s now billions upon billions of dollars riding on LLMs—and that investment won’t pay off if LLMs don’t work better than they do today. LLMs may not get us to AGI, but they are useful-ish for many business tasks. What those two turns of the crank are about is erasing the “ish.” At the same time, no one actually knows how to imbue an AI system with metacognition, so it’s not like there are some clear alternatives into which to pour gobs of money.

A huge number of businesses have now committed to AI, but are befuddled by how to get current LLM-based systems to do things that produce a good return on investment. Many of the best use cases big companies talk about—better customer service, code generation, and taking notes in meetings—are nice incremental wins, but not strategic game changers in any sense.

Two turns of the crank might help close this ROI gap. I think that’s particularly true if we worry a bit less about whether we achieve AGI this decade—or even ever. You can think of AGI as a new kind of Turing Test—AI will be intelligent when it can do everything well enough that it’s impossible to tell if we’re interacting with a human or a computer. And the problem with the Turing Test is that it frames AI as a contest between people and computers. If we think about AI as a complement to human labor and intelligence, rather than as a replacement for it, then a somewhat more reliable LLM might well be worth a turn of the crank.

AI scientists remain fixated on the lofty goal of AGI and superintelligence. For the rest of us, we just want software that works, and makes our businesses and lives more productive. We want AI factotums, not human facsimiles.

With that, here’s more AI news. (And a reminder, we won’t be publishing a newsletter on July 4, so you’ll next hear from the Eye on AI crew on Tuesday, July 9.)

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news…If you want to learn more about where AI is taking us, and how we can harness the potential of this powerful technology while avoiding its substantial risks, please check out my forthcoming book, Mastering AI: A Survival Guide to Our Superpowered Future. It’s out next week from Simon & Schuster and you can preorder your copy here. If you are in the U.K., the book will be out Aug. 1 and you can preorder here.

And if you want to gain a better understanding of how AI can transform your business and hear from some of Asia’s top business leaders about AI’s impact across industries, please join me at Fortune Brainstorm AI Singapore. The event takes place July 30-31 at the Ritz Carlton in Singapore. We’ve got Ola Electric’s CEO Bhavish Aggarwal discussing his effort to build an LLM for India, Alation CEO Satyen Sangani talking about AI’s impact on the digital transformation of Singapore’s GXS Bank, Grab CTO Sutten Thomas Pradatheth speaking on how quickly AI can be rolled out across the APAC region, Josephine Teo, Singapore’s minister for communication and information talking about that island nation’s quest to be an AI superpower, and much much more. You can apply to attend here. Just for Eye on AI readers, I’ve got a special code that will get you a 50% discount on the registration fee. It is BAI50JeremyK.

Correction, July 3: A news item below on Runway’s debut of its Gen 3 Alpha text-to-video AI model has been corrected. An earlier version of the item said Gen 3 Alpha was available for free. Users need to subscribe to use it. Runway charges $12 per month for its most basic subscription and more for premium versions.

AI IN THE NEWS

Amazon hires team from AI agent startup Adept. Amazon hired more than half the employees working at AI startup Adept, including its CEO and cofounder David Luan. Adept had been one of a clutch of startups trying to use LLMs to build a successful AI agent that could use business software to automate tasks such as building and analyzing spreadsheets and creating sophisticated slide decks. The deal was structured similarly to Microsoft’s non-acquisition earlier this year of AI startup Inflection, with most of the staff being hired but the Big Tech company making a large, lump sum payment to the remaining startup to license its technology (and make its investors, which had pumped at least $400 million into Adept to date, whole). The deals have been seen as a possible way to escape antitrust issues, although antitrust authorities are still probing the Microsoft-Inflection arrangement. Amazon says the Adept team will contribute to its own “AGI Autonomy” effort, which is thought to be working on agent-like AI for Amazon to both sell through its AWS cloud service and to use to power a new, more agent-like Alexa personal assistant and perhaps other AI agents too. You can read more from CNBC here.

In Los Angeles, an AI chatbot for students falls flat. The city had paid an AI startup called AllHere $6 million to develop an AI chatbot called Ed that could serve as a personal tutor and educational assistant for the 500,000 students in the LA public school system. But months after winning the contract, AllHere’s chief executive and founder left the company and it furloughed most of its staff, the New York Timesreported. AllHere delivered a version of Ed that was tested with 14-year-olds but was then taken offline for further refinement that might not happen now that AllHere has all but ceased operating. The school district has said it still hopes to have the chatbot widely available in September. Meanwhile, it has been left with a website AllHere built that simply collates information from other ed tech apps the district uses and doesn’t have the interactivity of a chatbot.

OpenAI’s ChatGPT caught making up links to investigative stories from media partners. That’s according to tests by journalism think tank and research group Nieman Lab, which wanted to see if OpenAI’s chatbot could correctly cite and link out to investigative stories from a number of publications that have signed deals to license their content to OpenAI. These include the Associated Press, the Wall Street Journal, the Financial Times, The Times (U.K.), Le Monde, El Pais, the Atlantic, the Verge, Vox, and Politico. Nieman Lab found the chatbot could not reliably link out to landmark stories from these publications. OpenAI told the journalism think tank that it has not yet rolled out a citation feature for ChatGPT.

Runway puts its latest text-to-video model Gen 3 into wide release. The sophisticated model, which can produce hyperrealistic and cinematic 10-second-long video clips from text prompts, is now available for anyone to use, the company said. Gen 3 is designed to compete with other text-to-video models, such as OpenAI’s Sora, which is still being tested with select users and has not been made generally available, and Kling, a model from Chinese company Kuaishou. Users need to purchase a subscription to Runway's services to use Gen 3 Alpha. Those start at $12 per month.

China is to develop 50 new AI standards by 2026. That’s according to a story from Reuters, which cites China’s industry ministry. The ministry says it will promulgate more than 50 national and industrial standards for AI deployment.

EYE ON AI RESEARCH

Missed connections. Tuhin Chakrabarty, who recently received his PhD in computer science from Columbia University, is fast developing a reputation for coming up with some clever tests of large language models’ true abilities in the domain on which they ought to perform best: language. In the past, he has tested LLMs’ abilities to both write short stories and serve as copilots for human writers, and in both cases found LLMs lacking. Now he and a group of fellow researchers from Columbia and Barnard College are back with another paper that points to a surprising weakness in LLMs’ language skills.

They looked at whether LLMs could solve the New YorkTimes's Connections game, in which a player is given a jumbled grid of four groups of four words that are in some way related. The player has to figure out how to regroup the words into their four categories based on the connections between them. Often the groupings are tricky, based on slang usages, alternative definitions, and homophones. They then compared how the LLMs performed with the scores of both novice and expert human players. It turns out that even the best performing LLM—which was OpenAI’s GPT-4o—could only completely solve 8% of the Connections puzzles, significantly worse than both novice and expert human players.

You can read the paper on the non-peer-reviewed research repository arxiv.org here.

FORTUNE ON AI

Can anyone beat Nvidia in AI? Analysts say it’s the wrong question —by Sharon Goldman

Hollywood tycoon Ari Emanuel blasts OpenAI’s Sam Altman after Elon Musk scared him about the future: ‘You’re the dog’ to an AI master —by Christiaan Hetzner

Exclusive: Leonardo DiCaprio-backed AI startup Qloo clinches $20 million investment from Bluestone Equity Partners —by Luisa Beltran

These are the nine AI startups that VCs wish founders would pitch them —by Allie Garfinkle

AI CALENDAR

July 15-17: Fortune Brainstorm Tech in Park City, Utah (register here)

July 21-27: International Conference on Machine Learning (ICML), Vienna, Austria

July 30-31: Fortune Brainstorm AI Singapore (register here)

Aug. 12-14: Ai4 2024 in Las Vegas

BRAIN FOOD

Should we use AI to evaluate other students, employees, and managers? There’s an interesting Wall Street Journal article that looks at how teachers are increasingly embracing a range of AI tools designed to help them grade essays in subjects such as English and history. The teachers who are using the tool find it saves them time. And the teachers are supposed to have the final say, looking over AI grading software’s assessment and deciding whether they concur. But of course, that’s not necessarily how teachers will use these systems. The temptation will be to largely defer to the AI-generated grades. And that’s a problem, partly because when the Journal tested different AI grading software on the same paper—which had received a 97% from the original human teacher who graded it—it received a wide range of grades, between 100% and just 62%.

Some teachers the Wall Street Journal interviewed said they noticed that the AI grader was far tougher on students than they would tend to be. Others said the AI system seemed to call out both minor failings and major ones equally, which might shake students’ confidence. “They’re sixth-graders, I don’t want to make them cry,” one said, explaining why she was reluctant to use the AI grading software.

Others pointed out that the AI could only judge writing in the paper itself—not assess the work in light of the student’s overall progression. It looked only at the end result and not at the process or effort a student might have made to produce that piece of work. As a result, some veteran teachers found it morally repugnant that some of their fellow educators would allow an AI system to stand in for their own professional judgment and, critically, their own empathy.

I think this last point is key. As AI becomes more capable and more ubiquitous, there will be a lot of areas where it will be tempting to deploy AI when we really shouldn’t. And one of those areas is, I think, cases in which we make high-consequence judgments about another human being. That would include legal proceedings, of course. But it would also include grades that impact a student’s further educational prospects. And it would include employee performance reviews—another area where I’ve heard some managers have begun turning to AI.

In many of these areas, proponents of AI software have argued that the software’s impartiality, the fact that it is never tired and never has a bad day, means it should replace sometimes fallible human judgment. But AI systems can be fallible and inconsistent too. What’s more, I think what we want in these cases is not actually impartiality. What we want is a fair application of human empathy. We want there to be the opportunity to appeal to the emotions of our evaluators, for them to be able to consider the whole of our circumstances, and to reflect on those circumstances based on their own life experiences. AI has no life experience and can never offer true empathy. The result may be impartial. But it may not be just.

This is the online version of Eye on AI, Fortune's biweekly newsletter on how AI is shaping the future of business. Sign up for free.