Despite how far advancements in AI video generation have come, it still requires quite a bit of source material, like headshots from various angles or video footage, for someone to create a convincing deepfaked version of your likeness. When it comes to faking your voice, that’s a different story, as Microsoft researchers recently revealed a new AI tool that can simulate someone’s voice using just a three-second sample of them talking.
The new tool, a “neural codec language model” called VALL-E, is built on Meta’s EnCodec audio compression technology, revealed late last year, which uses AI to compress better-than-CD quality audio to data rates 10 times smaller than even MP3 files, without a noticeable loss in quality. Meta envisioned EnCodec as a way to improve the quality of phone calls in areas with spotty cellular coverage, or as a way to reduce bandwidth demands for music streaming services, but Microsoft is leveraging the technology as a way to make text to speech synthesis sound more realistic based on a very limited source sample.
Current text to speech systems are able to produce very realistic sounding voices, which is why smart assistants sound so authentic despite their verbal responses being generated on the fly. But they require high-quality and very clean training data, which is usually captured in a recording studio with professional equipment. Microsoft’s approach makes VALL-E capable of simulating almost anyone’s voice without them spending weeks in a studio. Instead, the tool was trained using Meta’s Libri-light dataset, which contains 60,000 hours of recorded English language speech from over 7,000 unique speakers, “extracted and processed from LibriVox audiobooks,” which are all public domain.
Microsoft has shared an extensive collection of VALL-E generated samples so you can hear for yourself how capable its voice simulation capabilities are, but the results are currently a mixed bag. The tool occasionally has trouble recreating accents, including even subtle ones from source samples where the speaker sounds Irish, and its ability to change up the emotion of a given phrase is sometimes laughable. But more often than not, the VALL-E generated samples sound natural, warm, and are almost impossible to distinguish from the original speakers in the three second source clips.
In its current form, trained on Libri-light, VALL-E is limited to simulating speech in English, and while its performance is not yet flawless, it will undoubtedly improve as its sample dataset is further expanded. However, it will be up to Microsoft’s researchers to improve VALL-E, as the team isn’t releasing the tool’s source code. In a recently released research paper detailing the development of VALL-E, its creators fully understand the risks it poses:
“ Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”