By Shelley Evenson and Olof Schybergson, contributors
FORTUNE — Recently there’s been a lot of comparison between Apple’s Siri and Google’s Voice Search. Microsoft’s voice breakthroughs have also captured headlines. After decades of research and false starts, the competition in voice interfaces is now heating up thanks to its appearance on mobile devices, and the race is on to shape the definitive voice interface for the mass market.
But if the ballads to Siri’s limitations or sites dedicated to her often-hilarious interpretations of simple instructions are any indication, we still have a long way to go until a winner is declared. To succeed, the big players will need to conquer the human challenges to voice tech if they want to design a service that most people will happily incorporate into their daily routines.
For any advanced voice service, in addition to great voice recognition and interpretation, you need a compelling and simple interface that feels personal, context awareness that adds depth, and a very clever and fast backend that continuously learns the user’s intent. No one service is the ultimate answer – yet.
If this type of voice assistant did exist, it would have the potential to make voice interaction go from niche to mainstream. Here’s why.
MORE: Apple at the crossroads
The human component
Most people like to talk. But when faced with talking to machines, most of us get intimidated. Hand the biggest extrovert a microphone and they tend to clam up. Or just observe someone trying out a voice service for the first time. It simply doesn’t feel (or look) easy or natural.
So why don’t people like to talk to machines? Feedback (or the lack of it) is a big reason. When talking with another person, there are rich layers of feedback throughout the interaction – facial expressions, body language, tone of voice, and more. Constant real-time feedback is central in human communication, and both speaker and listener are active participants in the communication. With voice services, most of this feedback and interaction is stripped out.
Another reason that technical voice services failed to catch on earlier, even though they were common in computer programs, is that there’s simply less need to use voice on computers compared to mobile devices. When using a computer, your hands are already committed, the QWERTY text input is pretty efficient, and seeing text as you type it also confirms it’s correct. Voice input or output adds little value there. Smartphones offer the turning point for voice. When you’re on the move, chances are high that your hands could do some other useful things if you can use speech to interact with your mobile in order to find things or get stuff done.
The loudest voices
With Siri, Apple (AAPL) is attempting to conquer the feedback issue by designing a service that comes with a charming personality and a sense of humor. This embodiment of Apple’s voice service makes it recognizable, tangible, and almost human. Conversing with Siri somehow feels less strange than simply talking to a Google (GOOG) or Microsoft (AAPL) device, simply because we’re not used to talking to machines. The vision that Apple puts forward for Siri in its ads certainly helps to make talking to your device as if it were a person seem normal, natural, and of course, cool. This is a big contribution for moving voice interaction into the mainstream, but the reality is that Siri still has her shortcomings.
Compare this approach to Google’s. Google is a temple to technology, and their services tend to be utilitarian viruses – reliable, efficient, technically impressive, and able to find their ways into all corners of people’s lives. But they are not fun, quirky, and idiosyncratic – those are not Google traits. Google Voice Search has all the hallmark traits of other Google services. It makes it a great utility for finding things, but the barrier for engagement will still be higher for most people compared with Siri, simply because it’s less human.
There’s another key difference between Siri and Google’s Voice Search. Siri’s promise is to be an “assistant” that helps you get stuff done, not merely a search utility for finding information. This development might in fact come to define a major shift in how technology serves our needs, as the emphasis shifts from finding things to doing things for us. Your partner or best friend tend to help you reach your goals. This human coaching is valued and possible because these people know a lot about you. This knowledge also makes interacting with them pleasurable and rewarding. You don’t have to teach them the basics about your preferences all the time, they simply know. In a similar way, in order to become a valuable companion for you, Siri must have contextual smarts – that’s the really tricky part.
Siri is already deeply embedded in iOS, and with the help of information like device location and calendar appointments she aims to better understand the individual’s intent and personal context.
The next level of intelligence could be derived from the apps that reside on your iOS device. Just the app collection itself can give Siri some useful clues about your interests and habits, but she will become much smarter if she can get access to the data from some of the apps. The Citibank (C) app might tell her where you tend to spend money and how much, the NFL app will tell her what football team you support, Facebook (FB) and LinkedIn (LNKD) could tell her about your friends, job, and colleagues, and Spotify and Netflix (NFLX) will tell her what music and movies you like. Imagine how much better she could serve you if she knew all this.
Nuance Communications, the powerhouse in voice technologies (and the company that helps to power Siri), recently introduced Nina, its own voice solution that appears to be ready to tackle the app integration challenge on an enterprise level. Billed as a “virtual assistant for mobile customer service apps,” Nina promises to deliver a more compelling user experience through greater contextual awareness—and therefore a voice assistant that goes further than Siri to bridge the human-computer divide, making tasks like paying bills easy to do just by talking to your mobile phone.
Apple may not be too far behind in the race for better contextual awareness. Their recent patent applications show a clear intent to make Siri more integrated into the living room via Apple TV, and into our photos via iPhoto. These kinds of connection could make Siri more useful and ubiquitous.
However, the above is not quite reality yet, and compared to Google, Siri currently has an Achilles heel. Siri’s relatively weak backend fulfillment and accuracy often makes her appear less like the savvy and smart woman you would want her to be, and more like your three-year old child that’s only just mastered basic language skills and knows little about the world.
Microsoft has long been a voice pioneer, and they are still actively driving progress in the domain. In a stunning demonstration of assistance in context, Microsoft’s Chief Research Officer Rick Rashid recently presented the company’s latest voice breakthrough by having the technology translate his spoken English into Chinese in near real-time and in his own voice. If deployed at scale, imagine how much international business would be transformed if we could use Microsoft technology to effortlessly speak in any language. It’s the nightmare of professional translators and proud polyglots, but the dream of many travellers.
While Microsoft was an early player in mobile voice, they seem to have been unable to effectively capitalize on their capabilities. They have lots of voice experience and technology, and through their Bing investments they can compete with Google’s backend accuracy. What’s missing in their voice services is a clear embodiment of the technology that makes this relatively foreign and abstract phenomenon understandable and approachable for normal people. If Microsoft had invested a small portion of their voice R&D spend in designing a delightful human interface for the service, they would now be a real contender for defining the dominant voice interface.
Personalization, context, and intent
The approaches that Apple, Google and Microsoft have taken in voice make the challenges of the interaction abundantly clear.
With voice, the potential for intuitive interactions grows exponentially as the system knows more about you. With more data, it can effectively anticipate your context and intent. Looking ahead, we’ll begin to see exciting new services that listen in the background to deliver genie-like wishes for everything from books casually mentioned to the address for a restaurant in San Francisco that reminds us of the one we loved in New York and prompt you accordingly for orders or reservations.
We can imagine services that enable personalized simultaneous translation in global negotiations, or voice intelligence that replaces pre-recorded messaging with real-time contextual information delivered with your voice–for example, your spouse hearing why you’re unavailable at the moment.
The more we allow these systems the permission to “listen-in” the better they will be at offering the responses we need.
Today’s voice technology is advanced and complex, but it’s also largely predictable and consistent. In comparison, people are complex. They’re all different from each other, and their behavior is heavily influenced by culture, expectations, and mood.
Between the big voice service contenders, Apple has a clear advantage with Siri, because they’ve understood the importance of the human interface. Google will need to soften up a bit and step down from the technology altar if they want to become a leader in voice services for the masses. Microsoft should make a big new push with voice – they currently have all the assets needed, but not the product that makes waves.
For the other companies who consider using voice technology in their services, the advice is simple: the challenge for voice service adoption is not about technology anymore – increasingly the technology is available, smart, and reliable. The real challenge is to make the technology work for people. That’s where design comes in.
) co-founded Fjord in 2001, and has since led the company to become one of the world’s most successful service design consultancies working with clients including the BBC, Citibank, ESPN, Flickr, Foursquare, Harvard Medical School, Nokia, and Qualcomm, among others. Olof has years of experience collaborating with major brands to design breakthrough experiences that make complex systems simple and elegant. A pioneer in service design, Shelley (@shelleyke) recently joined Fjord as the Executive Director for Organizational Evolution. Previously, she was at Facebook as a Research Manager in Design and User Experience and a Principal User Experience Designer and Manager for Microsoft.