Sentence pairs that have equivalent meanings in different languages — typically used to train machine translation systems — have been available in sufficient quantities for only around 100 languages. New work doubled that number and produced a more capable model.
What’s new: Marta R. Costa-jussà and colleagues at Meta, Johns Hopkins, and UC Berkeley developed an automated process for scraping multilingual sentence pairs from the web. They released No Language Left Behind (NLLB-200), a machine translation model that handles 200 languages. They also released the models, code, and data used to build it.
Key insight: The web is full of text in various languages, including sentences that have the same meaning in different languages. For instance, unrelated pages in different languages may say the equivalent of, “Manchester United defeated Melbourne in yesterday’s match,” or “A long time ago in a galaxy far, far away.” An automated system can recognize such parallel sentences by learning to produce similar representations of sentences that have similar meaning regardless of their language. A teacher/student arrangement — with a multilingual teacher trained on languages with plentiful data to produce embeddings, and a separate monolingual student for each language scraped from the web — can align representations produced by the students.
How they built the dataset: The authors identified languages in text scraped from the web, trained a teacher model on pre-existing multilingual data, and used it to train a student model to produce similar representations for similar meanings in the web text.
- The authors trained fasttext, a linear classifier, to classify text according to its language. They trained it on publicly available datasets and their own corpus of 6,000 human-translated sentence pairs in 39 languages (released with this paper).
- Fasttext classified the language of individual sentences and full paragraphs in web-text corpora such as Common Crawl and ParaCrawl. The authors discarded sentences if their classification didn’t match that of the paragraph and removed sentences in languages for which they already had a lot of parallel data. After deleting duplicates, they had 43.7 billion sentences, each labeled as one of 148 languages.
- They trained a separate transformer — a student — on each language (or several similar languages) to produce similar representations for sentences with similar meanings. To do this, they trained a Bidirectional LSTM — the teacher — to translate between the 93 languages in the OPUS dataset. This model learned similar representations of equivalent sentences in different languages. Using publicly available datasets of parallel sentences, the teacher received a sentence in one language (usually English) while a student received the equivalent sentence in its designated language(s). The students learned to maximize the cosine similarity between the teacher’s and students’ representations. Simultaneously, the students were trained to fill in missing words of sentences in their designated language(s).
- The authors discarded sentence pairs if their representations’ cosine similarities were too different, leaving 1.1 billion parallel sentence pairs. Combined with pre-existing datasets, the parallel sentences represented 202 languages.
How they built the translator: NLLB-200 is a transformer encoder-decoder that comprises 54.5 billion parameters.
- In every fourth transformer layer (made up of a self-attention sublayer and a fully connected sublayer), the authors exchanged the fully connected sublayer with a Sparsely Gated Mixture-of-Experts (MoE) sublayer that activated only a subnetwork of neurons for each input. This enabled the network to learn to activate different portions depending on the language, which may have helped to prevent learning about languages that had many examples from interfering with learning about languages that had few.
- Training proceeded in two stages. In the first stage, NLLB-200 filled in missing words in sentences and translated between pairs of sentences in different languages. In the second, it trained only on translations. In both stages, the paired sentences included human-translated sentence pairs, sentences scraped from the web and paired automatically, and back translations in which the model converted its own translations back to the original language.
Results: The authors’ NLLB-200 model achieved 24.0 average spBLEU across all 202 languages, while the earlier DeltaLM achieved a 101-language average 16.7 spBLEU (which measures the overlap of word fragments between machine translations and ground truth, higher is better). A sparse NLLB-200 that used MoE rather than fully connected layers generally performed better than a dense version. For example, evaluated on Akan, a language spoken in Ghana for which little training data was available, the sparse model scored 36.2 chrF, while a dense version scored 35.6 chrF (which measures overlapping groups of consecutive characters between machine translations and ground truth, higher is better). NLLB-200 performed inconsistently compared to bilingual models: It achieved 36.2 chrF compared to an English-to-Akan model’s 16.8 chrF, but 51.4 chrF compared to an English-to-Gujarati model’s 51.7 chrF. A possible explanation: Languages that are dissimilar to other languages in the training data may not benefit as much from multilingual training.
Why it matters: Faced with an apparent scarcity of data, the authors extracted it from the web. The data didn’t need to be perfect: To compensate for flaws such as typographical and grammatical errors, the model learned to convert its own translations — of flawed sentences but presumably many more correct ones — into good sentences.
We’re thinking: University of Texas machine learning professor Raymond Mooney said, “You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector.” Apparently these researchers did it!