Skip to Content

It Might Get Loud: Inside Silicon Valley’s Battle to Own Voice Tech

Amazon, Apple, and Google are investing billions to make voice recognition the main way we communicate with the Internet. It will be the biggest technology shift since Steve Jobs launched the iPhone.

FOUR SHORT YEARS AGO, Amazon was merely a ferociously successful online retailer and the dominant provider of online web hosting for companies. It also sold its own line of consumer electronics devices, including the Kindle e-reader, a bold but understandably complimentary outgrowth of its pioneering role as a next-generation bookseller. Today, thanks to the ubiquitous Amazon Echo smart speaker and its Alexa voice-recognition engine, Amazon has sparked nothing less than the biggest shift in personal computing and communications since Steve Jobs unveiled the iPhone.

It all seemed like such a novelty at first. In November 2014, Amazon debuted the Echo, a high-tech genie that uses artificial intelligence to listen to human queries, scan millions of words in an Internet-connected database, and provide answers from the profound to the mundane. Now, sales of some 47 million Echo devices later, Amazon responds to consumers in 80 countries, from Albania to Zambia, fielding an average of 130 million questions each day. Alexa, named for the ancient Egyptian library in Alexandria, can take musical requests, supply weather reports and sports scores, and remotely adjust a user’s thermostat. It can tell jokes; respond to trivia questions; and perform prosaic, even sophomoric, tricks. (Ask Alexa for a fart, if you must.)

Amazon didn’t invent voice-recognition technology, which has been around for decades. It wasn’t even the first tech giant to offer a mainstream voice application. Apple’s Siri and Google’s Assistant predated Alexa by a few years, and Microsoft introduced Cortana around the same time as Alexa’s launch. But with the widespread success of the Echo, Amazon has touched off a fevered race to dominate the market for “smart” home devices by potentially making those objects as important as personal computers or even smartphones. Just as Google’s search algorithm revolutionized the consumption of information and upended the advertising industry, A.I.-driven voice computing promises a similar transformation. “We wanted to remove friction for our customers,” says Rohit Prasad, Amazon’s head scientist for Alexa, “and the most natural means was voice. It’s not merely a search engine with a bunch of results that says, ‘Choose one.’ It tells you the answer.”

Courtesy of Amazon

The powerful combination of A.I. with a new, voice-driven user experience makes this competition bigger than simply a battle for the hottest gadget offering come Christmastime—though it is that too. Google, Apple, Facebook, Microsoft, and others are all pouring money into competing products. In fact, Gene Munster of the investment firm Loup Ventures estimates that the tech giants are spending a combined 10% of their annual research-and-development budgets, more than $5 billion in total, on voice recognition. He calls the advent of voice technology a “monumental change” for computing, predicting that voice commands, not keyboards or phone screens, are fast becoming “the most common way we interact with the Internet.”

With the stakes so high, it’s no surprise the competition is fierce. Amazon holds an early lead, with 42% of the global market for connected speakers, according to research firm Canalys. Google is making itself heard too. Its Echo look-alike line of Google Home devices powered by its Google Assistant has a 34% share and recently has been outselling Amazon. The pricey and later-to-the-game Apple HomePod is a distant third. And in October, Facebook unveiled its line of Portal audio and video devices, which do some but not all of the voice-recognition tasks of its mega-cap competitors—and, notably, is powered by Alexa.

The current market for connected speakers and similar gadgets is big and growing—but not necessarily the most dramatic voice-related opportunity for the tech titans. Global Market Insights, a research firm, pegs global 2017 smart-speaker sales at $4.5 billion, a number it projects will grow to $30 billion by 2024. The hardware revenues, however, are largely beside the point. Amazon, for example, has sold the Echo at breakeven or less. Last holiday season it offered the bare-bones Echo Dot for $29, which ABI Research reckons is less than the cost of the device’s parts. Instead, each major player has a strategy that in some way feeds its larger goal of locking in customers to its other goods and services. Amazon, for one, uses the Echo line to increase the value of its Amazon Prime subscription service. Google hopes voice searches will eventually boost the already massive trove of data that feeds its advertising franchise. With Siri, Apple sees a way to tie together its phones, computers, TV controllers, and even the software that automakers are tying into their onboard systems.

Courtesy of Google

It’s too soon to predict a winner, what with all the investment and fast-moving innovations. But it’s safe to say the industry has coalesced around the notion that voice technology, enhanced by recent advancements in artificial intelligence, is the user interface of tomorrow. And it promises to have a democratizing impact on an industry that has separated novices from experts. “Voice enables all kinds of things,” says Nick Fox, a Google vice president who oversees product and design for the Google Assistant and Search. “It enables people who are less literate to use the system. It enables people who are driving. It enables people while cooking to hear a recipe. Every once in a while there is a tectonic shift in technology, and we think voice is one of those.”

For all that, voice recognition remains in its infancy. Its applications are rudimentary compared with where researchers expect them to go, and there’s a significant ick factor associated with voice. Legitimate concerns linger as to how much the tech companies are eavesdropping on their customers—and how much power they are accumulating in the form of data derived from the spoken information they are collecting. “With A.I. voice recognition, we’ve gone from the age of the biplane to the age of the jet plane,” says Mari Ostendorf, a professor of electrical engineering at the University of Washington and one of the world’s top scientists on speech and language technology. She notes that computers have gotten good at answering straightforward questions but still are relatively hopeless when it comes to actual dialogue. “It’s truly impressive what Big Tech has done in terms of how many words voice A.I. can now recognize and the number of commands it can understand. But we’re not in the rocket era yet.”

VOICE RECOGNITION HAS BEEN the next killer app for decades. In the 1950s, Bell Labs created a system called Audrey that could recognize the spoken digits one through nine. In the 1990s, PC users installed Dragon NaturallySpeaking, a program that could process simple speech without the speaker having to pause awkwardly after each word. But it wasn’t until Apple unleashed Siri on the iPhone in 2010 that consumers got a sense of what a voice-recognition engine tied to massive computing power could accomplish. Around the same time, Amazon, a company full of Star Trek aficionados—and led by a true Trekkie in CEO Jeff Bezos—began dreaming about replicating the talking computer aboard the Starship Enterprise. “We imagined a future where you could interact with any service through voice,” says Amazon’s Prasad, who has published more than 100 scientific articles on conversational A.I. and other topics. The result was Alexa, a multifaceted device designed to let consumers communicate more easily with Amazon.

As voice recognition improves—which it does as computing power gets faster, cheaper, more ubiquitous, and thus more mainstream—Amazon, Google, Apple, and others can more easily build a seamless network where voice links their smart home devices with other systems. It’s possible for Apple CarPlay users, for example, to tell Siri on the drive home to slot the latest episode of Game of Thrones as “up next” on their Apple TV and to command their HomePod to play it once they’ve arrived. Two years ago, Google released its voice-enabled Home that ties together its music offerings, YouTube, and its latest Pixel phones and tablets. Each tech giant, in other words, sees voice as a tether to the myriad digital products it is creating.

Read: 25 Ways A.I. Is Changing Business

The combatants, each wildly profitable and therefore able to fund ample research and marketing efforts, bring different assets to the table. Apple and Google, for example, own the two dominant mobile operating systems, iOS and Android, respectively. That means Siri and Google Assistant come preinstalled on nearly all new phones. Amazon, in contrast, needs to get consumers to install and then open the Alexa app on their iPhones or Android devices. “The extra step to open the Alexa voice app puts Amazon at a distinct disadvantage,” says Loup’s Munster, formerly a Wall Street analyst of computer companies. By contrast, all that’s required to activate Siri and the Google Assistant is to say their names.

That said, iOS and Android are open to third-party developers of all stripes, and Amazon is one of them­—meaning that nothing is stopping developers on both platforms from writing Alexa programs. Bezos bragged in an earnings release earlier this year that “tens of thousands of developers across more than 150 countries” are building Alexa apps and incorporating them into non-Amazon devices. Indeed, partnerships are a key battleground for voice applications. Alexa is built into “soundbars” from Sonos, headphones from Jabra, and cars from BMW, Ford, and Toyota. Google boasts integrations with audio equipment makers Sony and Bang & Olufsen, August smart locks, and Philips LED lighting systems, and Apple has partnerships that allow its HomePod to work with First Alert Security systems and Honeywell smart thermostats. “The beauty of these partnerships,” says Google’s Fox, “is that they allow us to link voice into the whole smart-appliance ecosystem. I don’t have to open my phone and go to an app. I can just say to the device, ‘Show me who’s at my front door,’ and it will pop right up. It’s simplifying by unifying.”

Artificial intelligence has long been a staple of dystopian popular culture, notably from films such as The Terminator and The Matrix, where wickedly clever machines rise up and pose a threat to humankind. Thankfully, we’re not there yet, but advances in A.I. and the availability of cheap computing have made impressively futuristic applications a reality. Early voice-recognition programs were only as good as the programmers who wrote them. Now these apps keep getting better because they are connected through the Internet to data centers. These complex mathematical models sift through huge amounts of data that companies have spent years compiling and learn to recognize different speech patterns. They can recognize vocabulary, regional accents, colloquialisms, and the context of conversations by analyzing, for example, recordings of call-center agents talking with customers or interactions with a digital assistant.

Pope: Heinz-Dieter Falkenstein—Getty images; Edison: Bettmann/Getty Images; Audrey: Courtesy of Nokia Bell Labs: Telephone: Sheila Terry—Science Source; Shoebox: Courtesy of [f500link ignore=true]IBM[/f500link] Corporate Archives, © 1961 [f500link]IBM[/f500link] Corporation; HAL: Kevin Bray—MGM/Photofest; Harpy: Raj Reddy—Youtube; Devices: Courtesy of Amazon, Apple, and Google
Voice-recognition systems rely as much on physics as on computer science. Speech creates vibrations in the air, which voice engines pick up as analog sound waves and then translate into a digital format. Computers can then analyze that digital data for meaning. Artificial intelligence turbocharges the process by first figuring out whether the sound is directed toward its systems by detecting a customer-chosen “wake word” such as “Alexa.” Then they use machine-learning models trained by what millions of other customers have said to them before to make highly accurate guesses as to what was said. “A voice-recognition system first recognizes the sound, and then it puts the words in context,” explains Johan Schalkwyk, an engineering vice president for the Google Assistant. “If I say, ‘What’s the weather in …,’ the A.I. knows that the next word is a country or a city. We have a 5-million-word English vocabulary in our database, and to recognize one word out of 5 million without context is a super hard problem. If the A.I. knows you’re asking about a city, then it’s only a one-in-30,000 task, which is much easier to get right.”

Computing power allows the systems multiple opportunities to learn. In order to ask Alexa to turn on the microwave—a real example—the voice engine first needs to understand the command. That means learning to decipher thick Southern accents (“MAH-­cruhwave”), high-pitched kids’ voices, non­-native speakers, and so on, while at the same time filtering out background noise like song lyrics playing on the radio. It then has to understand the many ways people might ask to use the microwave: “Reheat my food,” “Turn on my microwave,” “Nuke the food for two minutes.” Alexa and other voice assistants match questions with similar commands in the database, thereby “learning” that “reheat my food” is how a particular user is likely to ask in the future.

The technology has taken off in part because it has gotten so proficient at translating human commands into action. Google’s Schalkwyk says his company’s voice engine now responds with 95% accuracy, up from only 80% in 2013—about the same so-so level of accuracy human listeners achieve. One of the great recent triumphs in the field has been teaching the engines to filter out nonspoken background noise, a distraction that can frustrate the keenest human ear. These systems reach this level, however, only when the question is simple, like, “What time is Mission: Impossible playing?” Ask the Google Assistant or Alexa for an opinion or try to have an extended back-and-forth conversation, and the machine is likely to give either a jokey preprogrammed answer or to simply demur: “Hmm, I don’t know that one.”

TO CONSUMERS, voice-driven gadgets are helpful and sometimes entertaining “assistants.” For the tech giants that make them—and keep them connected to the computers in their data centers—they’re tiny but extremely efficient data collectors. About 60% of Amazon Echo and Google Home users have at least one household accessory, such as a thermostat, security system, or appliance, connected to them, according to Consumer Intelligence Research Partners. A voice-powered home accessory can record endless facts about a user’s daily life. And the more data Amazon, Google, and Apple can accumulate, the better they can serve those consumers, whether through additional devices, subscription services, or advertising on behalf of other merchants.

The commercial opportunities are straightforward. A consumer who connects an Echo to his thermostat might be receptive to an offer to buy a smart lighting system. Creepy though it may sound to privacy advocates, the tech giants are sitting on top of a treasure trove of personal data, the better with which to market more efficiently to consumers.

As with their overall strategies, the tech giants have different approaches to the data they collect. Amazon says it uses data from Alexa to make the software smarter and more useful to its customers. The better Alexa becomes, the company claims, the more customers will see the value of its products and services, including its Prime membership program. Although Amazon is making a big push into advertising—the research firm eMarketer projects the company will pull in $4.61 billion from digital advertising in 2018—a spokesperson says it does not currently use Alexa data to sell ads. Google, counterintuitively, considering its giant ad business, also isn’t positioning voice as an ad opportunity—yet. Apple, which loudly plays up the virtue of its unwillingness to exploit customer data for commercial gain, claims to be approaching voice merely as a way to improve the experience of its users and to sell more of its expensive HomePods.

DESPITE ONE OF AMAZON’S early selling points, what people aren’t asking their devices to do is help them shop. Amazon won’t comment on how many Echo users shop with the device, but a recent survey of book buyers by consulting firm the Codex Group suggests that it’s still early days. It found that only 8% used the Echo to buy a book, while 13% used it to listen to audiobooks. “People are creatures of habit,” says Vincent Thielke, an analyst with research firm Canalys, which focuses on tech. “When you’re looking to buy a coffee cup, it’s hard to describe what you want to a smart speaker.”

Amazon does say it’s not overly fixated on the Echo as a shopping aid, especially given how the device ties in with the other services it offers through its Prime subscription. Still, it holds out hope the Amazon-optimized computers it has placed in customers’ homes will boost its retail business. “What is available for shopping is your buying history,” says Amazon’s Prasad, the natural-language-processing scientist. “If you want to buy double-A batteries, you don’t need to see them, and you don’t need to remember which ones. If you’ve never bought batteries before, we will suggest Amazon’s brand, of course.”

The potential to boost shopping remains far bigger than selling replacement batteries, especially because so many merchants will want to partner with—and take advantage of—the platforms associated with the tech giants. The research firm OC&C Strategy Consultants predicts that voice shopping sales from Echo, Google Home, and their ilk will reach $40 billion by 2022—up from $2 billion today. A critical evolution of the speakers helps explain the promise. Both Amazon and Google now offer smart home devices with screens, which make the gadgets feel more like a cross between small computers and television sets and thus better for online shopping. Amazon launched the $230 Echo Show in the spring of 2017. Like other Echo devices, the Show has ­Alexa embedded, but it also enables users to see images. That means shoppers can see the products they are ordering as well as their shopping lists, TV shows, music lyrics, feeds from security cameras, and photos from that vacation in Montana, all without pushing any buttons or manipulating a computer mouse.

For its part, Google has partnered with four consumer electronics manufacturers, some of which have recently started selling smart screens integrated with the Google Assistant. The Lenovo Smart Display, for example, looks a lot like Facebook’s new Portal and retails for $250, the same price as the JBL Link View. LG plans to launch the ThinQ View. In October, Google started selling its own version, the Home Hub, for $149, with a seven-inch screen.

In the long run, Google is betting that having a screen will make voice shopping easier. The search company doesn’t sell products directly like Amazon, but its Google Shopping site connects retailers to the Google search engine. Already it is empowering the Google Home device as a shopping tool. It has a partnership with Starbucks, for example, that enables a user to tell the Google Assistant to order “my usual,” and the order will be ready upon arrival. Last year, Google cemented a partnership with Walmart, the world’s largest retailer. Shoppers can link their existing Walmart online account to Google’s shopping site and simply ask Google Home to check whether a favorite pair of running shoes is in stock, reserve a flat-screen TV for same-day pickup, or find the nearest Walmart store.

The rise of vision-recognition tech—voice recognition’s A.I. sibling, long used for matching faces of criminals in a crowd—will make shopping on these devices even more convenient. In September, Amazon announced it was testing with Snapchat an app that enables shoppers to take a picture of a product or a bar code with Snapchat’s camera and then see an Amazon product page on the screen. It’s not hard to imagine that the next step for shoppers will be to use the camera embedded in the Echo Show to snap a picture of something they’d like to buy and then see onscreen the same or similar items along with prices, ratings, and whether they’re available for Prime two-day free shipping.

EXCITING AS THIS technology is, it may take non­technophiles a bit of time to get used to speaking to machines. The tech giants aren’t the most trusted of companies right now, and they’ll need to convince consumers their devices aren’t eavesdropping for nefarious reasons. Smart speakers are supposed to click into listen mode only when they detect “wake words,” such as “Alexa,” or “Hey, Google.” In May, Amazon mistakenly sent a conversation about hardwood floors that a Portland executive was having with his wife to one of his employees. Amazon publicly apologized for the snafu, saying it had “misinterpreted” the conversation.

The spoken word has the potential for errors far beyond that of typed commands. This can have commercial repercussions. Last year a 6-year-old Dallas girl was talking to Alexa about cookies and dollhouses, and days later, four pounds of cookies and a $170 dollhouse were delivered to her family’s door. Amazon says Alexa has parental controls that, if used, would have prevented the incident.

Still, widespread adoption is likely because of the growing convenience of a voice-­connected world. With more than 100 million of these devices already installed and in listening mode, it’s only a matter of time before voice becomes the dominant way humans and machines communicate with each other—even if the conversation involves little more than scatological sounds and squeals of laughter.

Brian Dumaine is the author of a forthcoming book on Amazon to be published by Scribner.

This article originally appeared in the November 1, 2018 issue of Fortune.