May 29, 2019 weblog

Only few hundred training samples bring human-sounding speech in Microsoft TTS feat

by Nancy Cohen , Tech Xplore

Microsoft Research Asia has been drawing applause for pulling off text to speech requiring little training—and showing "incredibly" realistic results.

Kyle Wiggers in VentureBeat said text-to-speech algorithms were not new and others quite capable but, still, the team effort at Microsoft still has an edge.

Abdullah Matloob in Digital Information World: "Text-to-speech conversion is getting smart with time, but the drawback is that it will still take an excessive amount of training time and resources to build a natural-sounding product."

Looking for a way to shrug off burdens of training time and resources to create output that was natural-sounding, Microsoft Research and Chinese researchers discovered another way to convert text-to-speech.

Fabienne Lang in Interesting Engineering: Their answer turns out to be an AI text-to-speech using 200 voice samples (only 200) to create realistic-sounding speech to match transcriptions. Lang said, "This means approximately 20 minutes' worth."

That the requirement was only 200 audio clips and corresponding transcriptions impressed Wiggers in VentureBeat. He also noted that the researchers devised an AI system "that leverages unsupervised learning—a branch of machine learning that gleans knowledge from unlabeled, unclassified, and uncategorized test data."

Their paper is up on arXiv. "Almost Unsupervised Text to Speech and Automatic Speech Recognition" is by Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu. Author affiliations are Zhejiang University, Microsoft Research and Microsoft Search Technology Center (STC) Asia.

In their paper, the team said that the TTS AI utilizes two key components, a Transformer and denoising auto-encoder, to make it all work.

200-Pairs Only. "...especially as no more time is occupied or cost incurred in casting setting or printing beautiful letters..."

Yi Ren et al. method. "...especially as no more time is occupied or cost incurred in casting setting or printing beautiful letters..."

"Through the transformers, Microsoft's text-to-speech AI was able to recognize speech or text as either input or output," said an article in Edgy by Rechelle Fuertes.

Tyler Lee in Ubergizmo provided a definition of transformer: "Transformers...are deep neural networks designed to emulate the neurons in our brain.."

MathWorks had a definition for autoencoder. "An autoencoder is a type of artificial neural network used to learn efficient data (codings) in an unsupervised manner. The aim of an auto encoder is to learn a representation (encoding) for a set of data, denoising autoencoders is typically a type of autoencoders trained to ignore 'noise' in corrupted input samples."

Did results of their experiment show their idea is worth chasing? "Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR [ automatic speech recognition] on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data."

Why this matters: This approach may make text to speech more accessible, said reports.

"Researchers are continually working to improve the system, and are hopeful that in the future, it will take even less work to generate lifelike discourse," said Lang.

The paper will be presented at the International Conference on Machine Learning, in Long Beach California later this year, and the team plans to release the code in the coming weeks, said Wiggers.

Meanwhile, the researchers are not yet walking away from their work in presenting transformations with few paired data.

"In this work, we have proposed the almost unsupervised method for text to speech and automatic speech recognition, which leverages only few paired speech and text data and extra unpaired data... For future work, we will push toward the limit of unsupervised learning by purely leveraging unpaired speech and text data, with the help of other pre-training methods."

More information: Almost Unsupervised Text to Speech and Automatic Speech Recognition: speechresearch.github.io/unsuper/

Citation: Only few hundred training samples bring human-sounding speech in Microsoft TTS feat (2019, May 29) retrieved 29 June 2024 from https://techxplore.com/news/2019-05-samples-human-sounding-speech-microsoft-tts.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Introducing Cloud Text-to-Speech service for developers

61 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

22 hours ago

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (1)

Only few hundred training samples bring human-sounding speech in Microsoft TTS feat

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Introducing Cloud Text-to-Speech service for developers

Google Brain posse takes neural network approach to translation

Speech recognition from brain activity

Using multi-task learning for low-latency speech translation

Fighting offensive language on social media with unsupervised text style transfer

A computer can pick out speech even amid cacophony

Researchers develop the fastest possible flow algorithm

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

Mechanical computer relies on kirigami cubes, not electronics

New work explores optimal circumstances for reaching a common goal with humanoid robots

Phys.org

Medical Xpress

Science X

Only few hundred training samples bring human-sounding speech in Microsoft TTS feat

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

Introducing Cloud Text-to-Speech service for developers

Google Brain posse takes neural network approach to translation

Speech recognition from brain activity

Using multi-task learning for low-latency speech translation

Fighting offensive language on social media with unsupervised text style transfer

A computer can pick out speech even amid cacophony

Recommended for you

Researchers develop the fastest possible flow algorithm

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

Mechanical computer relies on kirigami cubes, not electronics

New work explores optimal circumstances for reaching a common goal with humanoid robots

Your Privacy