Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

A team of researchers at Microsoft has demonstrated a new AI system that is capable of mimicking a person's voice after training with a recording just three seconds long. The team explains developing the new app in a paper published on the arXiv preprint server. They have also posted a webpage demonstrating the app's capabilities.

Artificial intelligence applications require training on massive amounts of data. But in this new endeavor, the team at Microsoft has shown that does not always have to be the case.

The new app was built using Meta's EnCodec audio compression technology, and was originally intended as a way to improve the quality of phone conversations. Subsequent work showed that it is capable of far more—not only can it mimic a voice, it can also simulate tone and even the acoustics of the environment in which the original recording was made.

Microsoft did not do away with the need for a massive data set, of course; instead, the researchers shifted where it was used. The app was taught to "listen" to a string of words and then to replicate its sound using Meta's Libri-light dataset, which has over 60,000 hours of recordings made by 7,000 people speaking in English.

The examples Microsoft has provided demonstrate that the system works much better for some voices than others, and it has trouble with accents. But because the app is still in its early stages, it is likely its functionality will improve over time.

Microsoft has not made the source code for VALL-E public and likely will not do so, noting that it could be used in less than responsible ways—hoax recordings of politicians, for example. When combined with deepfake video, the results could take "fake news" to new heights. Microsoft's example has shown what is possible; thus, it would seem likely that similar systems by others will appear soon.

More information: Chengyi Wang et al, Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, arXiv (2023). DOI: 10.48550/arxiv.2301.02111

Journal information: arXiv

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

Identifying fake voice recordings

Emulating neurodegeneration and aging in artificial intelligence systems

New insights lead to better next-gen solar cells

Scientists pioneer new X-ray microscopy method for data analysis 'on the fly'

Microsoft claims that small, localized language models can be powerful as well

On the trail of deepfakes, researchers identify 'fingerprints' of AI-generated video

A new framework to generate human motions from language prompts

The world's largest 3D printer is at a university in Maine. It just unveiled an even bigger one

High-energy-density capacitors with 2D nanomaterials could significantly enhance energy storage

Study shows potential of super grids when hurricanes overshadow solar panels

Rubber-like stretchable energy storage device fabricated with laser precision

New tech could help traveling VR gamers experience 'ludicrous speed' without motion sickness

Why can't robots outrun animals?

Virtual sensors help aerial vehicles stay aloft when rotors fail

Going with the flow: Research dives into electrodes on energy storage batteries

Ultra-thin, flexible solar cells demonstrate their promise in a commercial quadcopter drone

Securing competitiveness of energy-intensive industries through relocation: The pulling power of renewables

New research demonstrates potential of thin-film electronics for flexible chip design

A simple 'twist' improves the engine of clean fuel generation

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

Let us know if there is a problem with our content

Thank you for taking time to provide your feedback to the editors

Share article

E-MAIL THE STORY