Text to Speech In Three Languages

Reading time
2 min read
Overview of the components of the multilingual text-to-speech engine

Human voices retain their distinctive character no matter what language they’re speaking. New research lets computers imitate human voices across a number of languages.

What’s new: Researchers at Google built a multilingual text-to-speech engine that can mimic a person’s voice in English, Spanish, and Mandarin. You can hear examples of its output here.

Key Insights: Some languages are written as letters (like English) and others as symbols (Mandarin). The difference can have a dramatic impact on choice of architecture and training. Yu Zhang and colleagues found that processing phonemes — the basic sounds that make up language — simplifies the task by eliminating the need to learn pronunciation rules that vary among languages.

How it works: The model embeds phonemes in a vector space before decoding those vectors into spectrograms for a WaveNet speech synthesizer. Using phonemes enables the model to find similarities in speech among different languages, so the system requires less training data per language to achieve good results.

  • The first step is to translate text input in various languages to phonemic spelling.
  • Phonemes are embedded in vector space using the architecture of the earlier Tacotron 2, a state-of-the-art, single-language, speech-to-text model.
  • In the training data, an individual speaker and the words spoken are strongly correlated. To disentangle them, a separate speaker classifier is trained adversarially against Tacotron 2 to judge whether a particular phoneme embedding came from a particular speaker. Adversarial training allows phoneme embeddings to encode spoken sounds without information about a particular speaker such as vocal pitch or accent.
  • Separate models are trained to create two additional embeddings. A speaker embedding encodes an individual's vocal tone and manner. A language embedding encodes a given language's distinctive accents.
  • A decoder takes speaker, language, and phoneme embeddings to produce a spectrogram for the WaveNet synthesizer, which renders the final audio output.
  • By switching speaker and accent embeddings to a different set, the model can generate different voices and accents.

Results: In a subjective assessment of naturalness, the multilingual model’s performance in three languages roughly matches Tacotron 2’s performance on one.
Why it matters: Training a text-to-speech model typically requires plenty of well-pronounced and labeled training data, a rare commodity. The new approach takes advantage of plentiful data in popular languages to improve performance in less common tongues. That could extend the reach of popular services like Siri or Google Assistant (which work in 20 and 30 languages respectively out of 4,000 spoken worldwide) to a far wider swath of humanity.

We're thinking: A universal voice translator — speak into the black box, and out comes your voice in any language — is a longstanding sci-fi dream. This research suggests that it may not be so far away from reality.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox