Microsoft's VALL-E can replicate anyone's voice from a 3-sec

Microsoft has recently unveiled its latest text-to-speech AI model called VALL-E, which can replicate a person’s voice almost perfectly. The model only needs a three-second audio sample to train. Once it learns a specific voice, it can synthesize audio of that person saying anything while preserving the speaker’s emotional tone and the environment.

How does it work?

The technique behind VALL-E is EnCodec, which Meta unveiled in October 2022. EnCodec enables VALL-E to produce discrete audio codec codes from text and acoustic cues. This differs from conventional text-to-speech systems that usually synthesize speech by modifying waveforms.

Meta also built the audio library LibriLight that the team used to train VALL-E. This library includes over 7,000 different voices among the 60,000 hours of English-language speech, mostly extracted from LibriVox public domain audiobooks. Furthermore, VALL-E can also imitate the “acoustic environment” of the sample audio. For instance, it can simulate the acoustic and frequency characteristics of a phone call in its synthetic output, making it sound like a phone call.

However, based on the paper published by the researchers, the results of the model are mixed, with some sounding machine-like and others being surprisingly realistic. But it retains the emotional tone of the original samples making the ones that work more acceptable.

The future potential of Microsoft VALL-E

Even with its limitations, VALL-E has huge potential and can have practical uses in various industries such as entertainment, education, and even in voice assistants. However, the team acknowledges the potential for misuse, and the research paper notes that bad actors can use it for spoofing or impersonating another person without their knowledge.

Microsoft has announced no plans to release a public version of VALL-E, but the research paper mentions that building a model that can detect actual speech from one generated by VALL-E is possible. “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating,” said Microsoft.