Google offers update on its human-like text-to-speech system

Google offers update on its human-like text-to-speech system
A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper. Credit: Google

Google has offered interested tech enthusiasts an update on its Tacotron text-to-speech system via blog post this week. In the post, the team describes how the system works and offers some audio samples, which Ruoming Pang and Jonathan Shen, authors of the post, claim were comparable to professional recordings as judged by a group of human listeners. The authors have also written a paper with the rest of their Google teammates describing their efforts, and have posted it to the arXiv preprint server.

For many years, scientists have been working to make computer generated speech sound more human and less robotic. One part of that mission is developing (TTS) applications, as the authors note. Most people have heard the results of TTS systems, such as the automated systems used by many corporations to field customer calls. In this new effort, the group at Google has combined what it learned from its Tacotron and WaveNet projects to create Tacotron 2—a system that takes the science to a new level. In listening to the provided samples, it is quite difficult and sometimes impossible to tell if a voice is a human or a TTS system voice.

In the following examples, one is generated by Tacotron 2, and one is the recording of a human, but which is which? “That girl did a video about Star Wars lipstick.”. Credit: Google
In the following examples, one is generated by Tacotron 2, and one is the recording of a human, but which is which? “That girl did a video about Star Wars lipstick.”. Credit: Google

To achieve this new level of accuracy, the team at Google used a sequence-to-sequence model optimized to work with TTS—it maps arrangements of letters to a series of features that describe the audio. The result is an 80-dimensional spectrogram. That spectrogram is then used as input to a second system that outputs a 24-kHz waveform using an architecture based on WaveNet. Both are neural networks trained using speech examples (from crowdsourcing applications such as Amazon's Mechanical Turk) and their corresponding transcripts. The new system is able to incorporate volume, pronunciation, intonation and speed, allowing for the creation of a much more human-like voice.

The team also notes that they are still working to improve the system, most notably to overcome problems with complex words and to make it work in real time. They would also like to add more emotion to the voice so listeners could actually hear happiness or sadness, for example, or to detect displeasure. Doing so would not only advance the science, but it would make interactions with digital assistants more intimate.

Explore further

Google leverages WaveNet model's gains, sounds seem more natural

More information: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, arXiv:1712.05884 [cs.CL]

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

Journal information: arXiv

© 2017 Tech Xplore

Citation: Google offers update on its human-like text-to-speech system (2017, December 29) retrieved 25 August 2019 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Feedback to editors

User comments

Dec 29, 2017
Google asked us to use full sentences in the regular Google search text field. It never was able to answer full sentences very well, and now they want us to ask a vocal question!

Try this search: When it is foggy outside is the humidity low or high?

The results do not have a simple answer like "it is high humidity." You must study the results to answer your own question.

Dec 30, 2017
@BendBob Maybe they want to refine their algorithms with all the data gathered ? It would make sense.

What could be kewl would be relying on the community for search results.

Dec 30, 2017
Anyone know why they wanted it to be more harder to differentiate than a real voices? What's the point of 100 step further more realistic sound that what an existing Alexa, Echo, or Home could achieve? I can't think of other use except to dupe some people like us.

Is this their only goal? or is this some kind of an innocent inner expression of creativity (art)? I find myself unable to relate to this research (or at least the one who write this article is too devoid of dreams & hope).

Dec 30, 2017
I fail to grasp how anyone can judge this as sounding human. Perhaps it sounds like the professionals, and they don't sound human because they are overtrained.

When a human who has taken the trouble to understand something reads it naturally (without trying to market), the inflections change dramatically. That can never be achieved with mere analysis of punctuation and grammar.

For a simplistic example, say "high or low" to yourself and then say "low or high". Most will say high with a higher voice regardless of its position. More complex examples require understanding at much higher levels - even environment such as whether we are in public or private.

The reason this is important becomes clear when you test the comprehension that listeners have of longer texts. It is very uncomfortable listening to longer texts when the tones don't properly communicate phrase divisions and connotations of the text. This shows clearly in the degree of immediate understanding by listeners.

Dec 31, 2017
I fail to grasp how anyone can judge this as sounding human. Perhaps it sounds like the professionals, and they don't sound human because they are overtrained.

They just showed you. They gave an example of a human talking, and then the duplication from google. It was very good for the most part.

I agree with you on the other stuff though. Having a computer, or just software, determine environment and setting is going to be a whole other game. Though I dont know if it will ever be necessary. Although, that hasnt stopped people from creating things we dont need in the past. Not likely to change now.

Jan 01, 2018
Proof! Proof that humans are learning to speak like computers. I would have considered both of them to be computer generated with a slightly better algorithm on the second sample.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more