January 11, 2023 report

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

by Bob Yirka , Tech Xplore

A team of researchers at Microsoft has demonstrated a new AI system that is capable of mimicking a person's voice after training with a recording just three seconds long. The team explains developing the new app in a paper published on the arXiv preprint server. They have also posted a webpage demonstrating the app's capabilities.

Artificial intelligence applications require training on massive amounts of data. But in this new endeavor, the team at Microsoft has shown that does not always have to be the case.

The new app was built using Meta's EnCodec audio compression technology, and was originally intended as a way to improve the quality of phone conversations. Subsequent work showed that it is capable of far more—not only can it mimic a voice, it can also simulate tone and even the acoustics of the environment in which the original recording was made.

Microsoft did not do away with the need for a massive data set, of course; instead, the researchers shifted where it was used. The app was taught to "listen" to a string of words and then to replicate its sound using Meta's Libri-light dataset, which has over 60,000 hours of recordings made by 7,000 people speaking in English.

The examples Microsoft has provided demonstrate that the system works much better for some voices than others, and it has trouble with accents. But because the app is still in its early stages, it is likely its functionality will improve over time.

Microsoft has not made the source code for VALL-E public and likely will not do so, noting that it could be used in less than responsible ways—hoax recordings of politicians, for example. When combined with deepfake video, the results could take "fake news" to new heights. Microsoft's example has shown what is possible; thus, it would seem likely that similar systems by others will appear soon.

More information: Chengyi Wang et al, Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, arXiv (2023). DOI: 10.48550/arxiv.2301.02111

Journal information: arXiv

Citation: Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording (2023, January 11) retrieved 17 July 2024 from https://techxplore.com/news/2023-01-microsoft-vall-e-faithfully-voice.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Identifying fake voice recordings

203 shares

Feedback to editors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

14 hours ago

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

16 hours ago

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

18 hours ago

Large language models make human-like reasoning mistakes, researchers find

18 hours ago

Unveiling a new class of synthetic fuels

19 hours ago

Microsoft unveils software that allows LLMs to work with spreadsheets

19 hours ago

New technique to assess a general-purpose AI model's reliability before it's deployed

20 hours ago

New system enables intuitive teleoperation of a robotic manipulator in real-time

22 hours ago

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

Jul 16, 2024

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Jul 15, 2024

Load comments (1)

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Identifying fake voice recordings

Microsoft taking 4% stake in London Stock Exchange

Google will start transcribing audio recordings again

Microsoft announces ND A100 v4 VM series—a new series of AI virtual machines

Google vows to do more to protect your voice data

Microsoft says hackers viewed source code, didn't change it

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

Flexible, permeable and 3D integrated electronic skin combines liquid metal circuits with fibrous substrates

Phys.org

Medical Xpress

Science X

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three second recording

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Related Stories

Identifying fake voice recordings

Microsoft taking 4% stake in London Stock Exchange

Google will start transcribing audio recordings again

Microsoft announces ND A100 v4 VM series—a new series of AI virtual machines

Google vows to do more to protect your voice data

Microsoft says hackers viewed source code, didn't change it

Recommended for you

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

Flexible, permeable and 3D integrated electronic skin combines liquid metal circuits with fibrous substrates

Your Privacy