With the growing use of virtual personal assistants such as Google assistant, Apple’s Siri, Microsoft’s Cortana and Amazon’s Alexa, AI speech itself as well as the language it uses is becoming more and more important. Currently, the speech generated by the virtual assistants is actually real human speech, just chopped up and rearranged as necessary. This is known as concatenative speech synthesis and relies on every word or phrase being pre-recorded. Now though, Google’s DeepMind AI is generating it’s own voice. It still uses real human speech, but only to learn patterns and intonations. Using the same linguistic information it then forms it’s own speech. It also has the ability to generate a variety of different voices from the same information. This system is called parametric speech synthesis and is potentially much more flexible since it isn’t limited by the recordings it has available.
The Google system, called WaveNet, has been tested on both English and Mandarin Chinese listeners, and judged to be significantly more realistic than the other speech generators used in the test. In blind tests using 100 test sentences, the listeners gave WaveNet high realism ratings, although still lower than that of a genuine human voice. The system isn’t even limited to just creating speech. It has, and has already demonstrated, the learned ability to generate it’s own music by analyzing existing piano recordings. Given that ability there’s no reason it couldn’t equally learn all types of music and instruments as well as creating it’s own.
The Turing Test to validate machine intelligence is based on the textual words used in a conversation rather than the actual speech, but perhaps we are close to the point where people will also struggle to tell the difference between a real human voice and a computer generated voice. Although WaveNet sounds like it could be promising technology, due to the processing power that’s required for the analysis to work its magic, it won’t be coming to consumer devices in the near future and is primarily an experimental project at the moment. As an indicator for the future, our devices are likely to someday be finding their own voice.