Google’s DeepMind Claims Massive Progress in Synthesized Speech
Researchers at Google’s DeepMind artificial intelligence division claim to have come up with a way of producing much more natural-sounding synthesized speech, compared with the techniques that are currently in use.
Existing text-to-speech (TTS) systems tend to use a system called concatenative TTS, where the audio is generated by recombining fragments of recorded speech. There’s also a technique called parametric TTS that generates speech by passing information through a vocoder, but that sounds even less natural.
So DeepMind has come up with a new technique called WaveNet that learns from the audio it’s fed, and produces raw audio sample-by-sample. To give an idea of how detailed that is, we’re talking at least 16,000 samples per second.
Get Data Sheet, Fortune’s technology newsletter.
A WaveNet is a “neural network”—essentially an artificial brain—that is trained on real waveforms and then uses statistics to choose which samples of that audio to use when “speaking,” piece by piece.
“Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio,” DeepMind’s researchers said in a post about their findings.
That post is well worth checking out, as it includes several clips of the same pieces of text, read out by different speech synthesis techniques. For both U.S. English and Mandarin Chinese, the WaveNet-generated audio is noticeably more realistic than that produced by concatenative TTS.
DeepMind claimed that blind tests with human subjects showed the WaveNet audio to be at least 50% closer to real human speech—though of course such tests are subjective.
For more on DeepMind, watch our video.
DeepMind’s researchers said they would be able to add emotions and accents as inputs, to make the speech sound even more realistic.
Fascinatingly, WaveNets can generate speech without text—or at least, what the neural networks think speech should sound like. As the clips show, these are word-like sounds that mean nothing, and they’re rather creepy.
The same techniques can also be used to create non-speech audio. The post includes clips of the “music” generated by WaveNets that were trained on classical music—again, a good approximation of actual music that might get away with it if you’re not listening too closely.
Again, all this requires an awful lot of computational power and isn’t being used in any real-world applications just yet. But as is always the case with such things, it’s really just a matter of time before computers will be able to sound extremely human.