Found in Translation Apple's method to identify a language from a few words

Published

Jun 17, 2020

Reading time

1 min read

Language models can’t correct your misspellings or suggest the next word in a text without knowing what language you’re using. For instance, if you type “tac-,” are you aiming for “taco,” a hand-held meal in Spanish, or “taca,” a crown in Turkish? Apple developed a way to head off such cross-lingual confusion.

What’s new: It’s fairly easy to identify a language given a few hundred words, but only we-need-to-discuss-our-relationship texts are that long. Apple developed a way to tell, for example, Italian from Turkish based on SMS-length sequences of words.

Key insight: Methods for identifying languages in longer text passages take advantage of well studied statistical patterns among words. Detecting languages in a handful of words requires finding analogous patterns among letters.

How it works: The system comprises only a lightweight biLSTM and a softmax layer. This architecture requires half the memory of previous methods.

A separate model narrows the possibilities by classifying the character set: Do the letters belong to Latin? Cyrillic? Hanzi? For instance, European languages and Turkish use the Latin alphabet, while Japanese and some Chinese languages use Hanzi.
The biLSTM considers the order of input characters in both directions to squeeze out as much information as possible.
Then it predicts the language based on the features it extracts.

Results: The system can spot languages in 50 characters as accurately as methods that require lots of text. Compared with Apple’s previous method based on an n-gram approach, the system improves average class accuracy on Latin scripts from 78.6 percent to 85.7 percent.

Why it matters: Mobile devices don’t yet have the horsepower to run a state-of-the-art multilingual language model. Until they do, they’ll need to determine which single-language model to call.

We’re thinking: Humans are sending more and more texts that look like this: ????????????. We hope NLP systems don’t go ????.

Subscribe to The Batch