Under normal circumstances, Ignacio and I shouldn’t be able to shoot the breeze about the weather so easily.
I’m seated in a seventh-floor conference room in Manhattan. Ignacio is in Madrid, where, incidentally, the weather is much better. Ignacio is speaking in his native, regionally-tinged Spanish. I’m speaking vanilla American English, the variety spoken by a suburban-born East Coast American several generations removed from his European heritage.
Yet our conversation, barring a few inconsequential hiccups, is seamless. We’re conversing over Skype Translator, a tool released in beta this week to roughly 50,000 Skype users that interprets and translates voice calls from English to Spanish and back in real time. (Support for dozens of other languages is in the pipeline.)
Skype Translator is more than a decade of behind-the-scenes work at Microsoft Research. (Microsoft acquired Skype in 2011.) The tool aims to be something like the Universal Translator from Star Trek: one language goes in and another language comes out, allowing two speakers who know nothing of each others’ tongues to interact in normal, if slightly stunted, conversation.
In developing such a platform, Microsoft Research (MSFT) has not only solved an extremely difficult computational problem that has for years dogged academics, researchers, and even DARPA, the research arm of the U.S. Department of Defense. In the very near-term it could change the way individuals and companies across the world interact. “Our goal is to have every human in the world able to use this tool on whatever device they have,” says Gurdeep Pall, corporate vice president for Skype. “And this is machine learning, so the more usage we get, the better it gets.”
It’s already quite good. When you initiate a Skype Translator call, you select a male or female voice to work as vocal proxy. (Microsoft Research has already demonstrated technology that can translate in the user’s own voice, but that feature will come in a later version of the product.) From there, the call functions as a regular Skype voice call.
But there’s a big difference. Between speakers, a third participant on the call—a Skype Translator bot—processes each speaker’s language through a multi-tiered, cloud-based application for speech recognition, translation, and synthesis. All of that complexity happens in the background thanks to a tremendous amount of computing power and software wizardry that the bot can access in the cloud. For the user, the experience is much more simple: A mere half-second after one party finishes speaking, an audio translation of his or her speech plays through to the recipient alongside a text transcript that runs alongside the video or voice call.
The translation isn’t perfect. The software still sometimes gets hung up on idiom, nuance, on problems created by tone, common mispronunciation, or the lazy way most of us enunciate the words of our mother tongues. But the fact that Skype Translator is right most of the time—so often that not only can the average person correct for errors using conversational context, but also often enough for Microsoft to feel comfortable releasing it as a consumer product—represents a leap forward for machine translation specifically and machine learning more generally.
The latter aspect is the key to Skype Translator’s future success. The more users speak through Microsoft’s translation platform, the better it understands human language and the more accurate it becomes. Errors that the software makes today will disappear as the software logs more examples of natural human language, of the way humans write and speak differently, and of how they word things differently for social media, e-mail, chat, and spoken conversation. “The way people interact turns out to be really interesting,” says Peter Lee, corporate vice president at Microsoft Research. “The whole process for us in research has been illuminating.”
That process has involved building technology on the back of text-based translation and speech recognition platforms like Cortana and Bing Translator that Microsoft had already developed, as well as returning to a technology known as “deep neural networks,” or DNN, that was pushed aside several years ago as researchers found other method of machine translation more promising, says Vikram Dendi, strategy director for Microsoft Research. Returning to DNN sometime between 2009 and 2010, Microsoft found that technology had caught up to the vision of DNN-based speech recognition. The explosion in cloud computing power and advances in DNN technology itself put the previously impossible within reach.
The result is a solution to a problem that has long vexed everyone from Fortune 500 companies to small businesses trying to expand across borders to soldiers operating in foreign conflict zones: how do I interact with the people I need to right now when we don’t share a shred of common language? Whether you’re the proprietor of a Spanish bed and breakfast trying to accommodate foreign travelers or an executive at a major multinational corporation working with peers in foreign countries, communication is currency.
That reality makes being the first to market with a working, realtime translation product—and one that will soon work across platforms—all the more important. Because the machine learning aspect of Skype Translator ensures that the product will improve its performance over time, getting it in consumers’ hands ahead of potential competitors—and there are many—gives Microsoft and Skype a considerable head-start.
Skype’s ability to scale Translator over the next several months will be crucial. In beta, Skype Translator is only available on devices running Windows 8.1 or newer. Functionality for other operating systems, including Apple’s iOS, will come later. “Skype is there on nearly every relevant platform, and this will be an integral piece of Skype moving forward,” Pall says. The company already facilitates communication for 300 million people worldwide, having hosted some 2 billion minutes of calling. It’s not lost on Skype’s leadership that with Translator incorporated in its core product, the company could more deeply penetrate the market for real-time translation.
Microsoft declined to say how quickly it plans to support more languages or which specific languages it plans to support in the future. (The additions won’t take a decade, Pall says.) In the relatively near term, Microsoft and Skype could add voice translation support for the 40-plus languages it already supports for text translation. The move would solidify Skype in users’ minds as the go-to voice translation tool—the thing that lets Ignacio in Madrid and a journalist in New York converse like there’s no ocean, much less a language barrier, between them.
“We’re very excited about getting the stuff out,” Pall says. “We’re working very hard to make sure it doesn’t take years.”