X

Google's Tacotron 2 Text-To-Speech Tech Sounds Like A Human

Google recently wrapped up the development of Tacotron 2, the next generation of its text-to-speech technology that the company has been perfecting for years, as revealed by a research paper authored by the firm’s engineers and scientists that’s presently awaiting a peer review. The Mountain View, California-based tech giant managed to bring its solution to the point that it’s hardly distinguishable from an actual human voice, with the underlying principles of the platform still being the same as the system continues to rely on two simultaneously running neural networks.

The first network is tasked with turning ordinary text into a spectrogram, a visual representation of audio frequencies associated with whatever piece of writing the tech is told to dictate. Its results are then imported into another neural network called WaveNet which was developed by Google’s British AI subsidiary DeepMind. WaveNet is already powering the English and Japanese versions of the Google Assistant and is set to soon expand to other variants of the AI service, being touted as the cutting-edge text-to-speech solution second to none. The main advantage WaveNet has over its alternatives is its ability to function as a standalone service that neither needs nor wants access to a large database of pre-recorded sounds. Instead, the technology generates its own sounds based on spectrograms fed to it by the first neural network backing Tacotron 2, with the end result of its efforts usually being indistinguishable from an actual human voice. The research paper linked below contains some links leading to examples of the solution’s capabilities.

Besides the order of words, Tacotron 2’s enunciation is also based on punctuation and capitalization, with the solution hence being able to infer the importance of particular words and use these conclusions to decide whether to stress them or quickly go over them, much like an actual person would. The main drawback of the current technology is the lack of customization as it only supports a single female voice in its most advanced state. Adding more voices to the equation would require extensive retraining efforts on DeepMind’s part, though such endeavors are likely to start in the future.