More than 900 indigenous languages are spoken across the Americas, nearly half of all tongues in use worldwide. A website tracks the growing number of resources available for natural language processing researchers interested in studying, learning from, and saving these fading languages.
What’s happening: Naki collects NLP efforts involving indigenous American languages.
What’s inside: Researchers at the National Autonomous University of Mexico noticed a distinct rise in NLP papers focused on Native American languages over the past five years. They organized all the papers and research tools they could find.
- The researchers found NLP tools for 35 languages.
- North American languages receive the most research attention, despite having fewer speakers on average than those in Mesoamerica and South America.
- Native American tongues offer a diversity of dialects within an individual language. That makes it difficult for NLP models to develop standardized dictionaries and syntaxes.
Behind the news: Language families are linguistic groupings with similar origins and closely related syntax and definitions, such as Indo-European, or Sino-Tibetan. The Americas are home to more than 70 such groups, according to some researchers.
Why it matters: The resources collected on Naki are part of a growing effort to apply NLP to less common languages. The effort poses fundamental research problems. Like other rare tongues, Native American languages suffer from small written datasets — while NLP is very data-hungry — as well as broad variation from speaker to speaker and high complexity. For example, like Mandarin, many languages from Central Mexico shift vocal pitch to give identical words different meanings. NLP could benefit immeasurably by solving these problems.
We’re thinking: While it’s valuable to study rare languages for their own sake, there’s a huge opportunity in giving people who rely on them access to capabilities that much of the world takes for granted: voice recognition, speech to text, automatic translation, and the like. The usual techniques won’t get us there, but working with these languages could lead researchers to the necessary breakthroughs.