ギーク
From text to sound: Google’s Tacotron 2
irina
Google researchers presented Tacotron 2, a text-to-speech (TTS) system performing naturally sounding artificial speech. The system doesn’t require manually created feature sets or extensive libraries of speech fragments. Instead, Tacotron 2 synthesizes all the necessary information from speech samples paired with text transcripts.
Predecessors
Tacotron 2 incorporates assets of two previous speech-generation projects: WaveNet and the original Tacotron.
WaveNet is a deep convolutional neural network (CNN), producing raw waveforms (including human-like speech and music) from audio input. Introduced in September 2016, WaveNet performed convincing speech examples, reproducing tones and accents of the voice records given as inputs.
Initial Tacotron is a sequence-to-sequence architecture, producing high quality audio from a sequence of characters. Instead of elaborated feature manufacturing, Tacotron employed a single neural network, learning from the input data. In addition to WaveNet highlights, Tacotron synthesized some high-level speech attributes, such as intonation and prosody.
Improvements
Resulting Tacotron 2 compounds feature prediction network, mapping characters to mel spectrograms, and a WaveNet-based vocoder synthesizing the spectrograms into waveforms.
As a result, Tacotron 2 produces really impressive speech samples, some of which are barely distinguishable from natural human speech (samples available on github, check it out!)
Challenges
The system still has some difficulties with pronouncing complex words, “and in extreme cases it can even randomly generate strange noises”. Also, among Tacotron team’s challenging tasks are generating audio in real time and further enhancing output features (e.g. simulating emotions).
Links
Google blog announce is here.
Find out more about the system in the recent paper:
“Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.