Most speech-to-speech translation systems use text as an intermediate mode. So how do you build an automated translator for a language that has no standard written form? A new approach trained neural networks to translate a primarily oral language.
What’s new: Peng-Jen Chen, Kevin Tran, Yilin Yang and teammates at Meta described a system that translates speech between English and Hokkien, which is spoken by millions of people in east Asia.
Key insight: Few people know how to translate between English and Hokkien, which makes it hard to assemble a dataset sufficient for training an English-Hokkien translation model. However, a fair number of people can translate between Mandarin and English and between Mandarin and Hokkien. By translating from English to Mandarin and from Mandarin to Hokkien, it’s possible to build a database of English-Hokkien speech pairs.
The dataset: The authors collected a corpus of English, Mandarin, and Hokkien data. They employed human translators to translate the corpus. They used the translated corpus to synthesize further data.
- The initial corpus comprised (a) videos of Hokkein dramas with subtitles (5.8 hours of which were manually translated from Mandarin text into English text and speech), (b) an existing dataset of Hokkien speech (manually translated into English text and 4.6 hours of English speech), and (c) an existing dataset of English-to-Mandarin speech and text (manually translated into 86 hours of Hokkien speech).
- To synthesize additional English-to-Hokkien speech pairs, the authors used an existing trained model to translate English text with matching speech into Mandarin text. Then, using the Hokkien dramas, they trained a text-to-speech transformer to translate Mandarin text to Hokkien speech. This process yielded 1,500 hours of corresponding English-Hokkien speech.
- They used a similar process to synthesize additional Hokkein-to-English speech pairs (starting with the Hokkien dramas). This process yielded 8,000 hours of corresponding Hokkien-to-English speech.
The translators: Separate speech-to-speech systems with identical architectures translate from Hokkien to English and English to Hokkien, using Mandarin text as a stepping stone between the target languages.
- Given English or Hokkien speech, HuBERT encoders and HiFi-GAN decoders learned to convert English and Hokkien speech to tokens and back.
- Given English or Hokkien speech, separate wav2vec 2.0 transformers learned to convert them into tokens.
- Given English or Hokkein tokens, separate mBART decoders learned to turn them into Mandarin or English text respectively.
- Given the resulting text, two transformer layers learned to translate it into Hokkien or English speech tokens.
- At inference, the HiFi-GAN decoder converts those tokens into speech.
Results: The authors compared their system to a baseline of their own design that translated directly between the spoken languages using an encoder-decoder. They evaluated the systems according to ASR-BLEU, which compares text overlap (higher is better) against reference text after translating speech to text. To render Hokkien speech as text for comparison, they developed a separate model that translated Hokkien speech into a phonetic script called Tâi-lô. Converting English to Hokkien, their system achieved 7.3 ASR-BLEU, whereas the baseline achieved 6 ASR-BLEU. Converting Hokkien to English, their system achieved 12.5 ASR-BLEU, whereas the baseline achieved 8.1 ASR-BLEU. Without the augmented data, both their system and the baseline scored worse by 6 ASR-BLEU to 9 ASR-BLEU.
Why it matters: Forty percent of the world’s languages have no standard written form, which means they’re left out of current translation systems. This method provides a blueprint for machine translation of other primarily oral languages.
Yes, but: Hokkien is spoken in several dialects, some of which are mutually unintelligible. So, while this system presumably serves most Hokkien speakers, it doesn’t serve all of them yet.
We’re thinking: The next step is to hook up the Hokkein-English model to existing translators for other languages. Is it good enough? ASR-BLEU scores in the 7-to-12 range are low compared to scores for, say, English-German, which are around 30. And, because translation errors compound from one language to the next, the more intermediate steps required to reach the target language, the lower the final translation quality. One way or another, we want to hear Hokkien speakers talking to everyone!