From text to sound: Google’s Tacotron 2

ギーク

2017/12/23

From text to sound: Google’s Tacotron 2

irina

この記事は1年以上前に書かれたもので、内容が古い可能性がありますのでご注意ください。

Google researchers presented Tacotron 2, a text-to-speech (TTS) system performing naturally sounding artificial speech. The system doesn’t require manually created feature sets or extensive libraries of speech fragments. Instead, Tacotron 2 synthesizes all the necessary information from speech samples paired with text transcripts.

Predecessors

Tacotron 2 incorporates assets of two previous speech-generation projects: WaveNet and the original Tacotron.

WaveNet is a deep convolutional neural network (CNN), producing raw waveforms (including human-like speech and music) from audio input. Introduced in September 2016, WaveNet performed convincing speech examples, reproducing tones and accents of the voice records given as inputs.

Initial Tacotron is a sequence-to-sequence architecture, producing high quality audio from a sequence of characters. Instead of elaborated feature manufacturing, Tacotron employed a single neural network, learning from the input data. In addition to WaveNet highlights, Tacotron synthesized some high-level speech attributes, such as intonation and prosody.

Improvements

Resulting Tacotron 2 compounds feature prediction network, mapping characters to mel spectrograms, and a WaveNet-based vocoder synthesizing the spectrograms into waveforms.

As a result, Tacotron 2 produces really impressive speech samples, some of which are barely distinguishable from natural human speech (samples available on github, check it out!)

Challenges

The system still has some difficulties with pronouncing complex words, “and in extreme cases it can even randomly generate strange noises”. Also, among Tacotron team’s challenging tasks are generating audio in real time and further enhancing output features (e.g. simulating emotions).