Chatbot for Minority Languages Startup Two AI launches SUTRA, a multilingual model for South Asian markets

Published
Jun 26, 2024
Reading time
3 min read
Chatbot for Minority Languages: Startup Two AI launches SUTRA, a multilingual model for South Asian markets

An AI startup that aims to crack markets in southern Asia launched a multilingual competitor to GPT-4.

What’s new: The company known as Two AI offers SUTRA, a low-cost language model built to be proficient in more than 30 languages, including underserved South Asian languages like Gujarati, Marathi, Tamil, and Telugu. The company also launched ChatSUTRA, a free-to-use web chatbot based on the model.

How it works: SUTRA comprises two mixture-of-experts transformers: a concept model and an encoder-decoder for translation. A paper includes some technical details, but certain details and a description of how the system fits together are either absent or ambiguous. 

  • The concept model learned to predict the next token. The training dataset included publicly available datasets in a small number of languages for which abundant data is available, including English.
  • Concurrently, the translation model learned to translate 100 million human- and machine-translated conversations among many languages. This model learned to map concepts to similar embeddings across all languages in the dataset. 
  • The authors combined the two models, so the translation model’s encoder fed the concept model, which in turn fed the translation model’s decoder, and further trained them together. More explicitly, during this stage of training and at inference, the translation model’s encoder receives text and produces an initial embedding. The concept model processes the embedding and delivers its output to the translation model’s decoder, which produces the resulting text. 
  • SUTRA is available via an API in versions that are designated Pro (highest-performing), Light (lowest-latency), and Online (internet-connected). SUTRA-Pro and SUTRA-Online cost $1 per 1 million tokens for input and output. SUTRA-Light costs $0.75 per 1 million tokens. 

Results: On multilingual MMLU (a machine-translated version of multiple-choice questions that cover a wide variety of disciplines), SUTRA outperformed GPT-4 in four of the 11 languages for which the developer reported the results: Gujarati, Marathi, Tamil, and Telugu. Moreover, SUTRA’s tokenizer is highly efficient, making the model fast and cost-effective. In key languages, it compares favorably to the tokenizer used with GPT-3.5 and GPT-4, and even narrowly outperforms GPT-4o’s improved tokenizer, according to Two AI’s tokenizer comparison space on HuggingFace. In languages such as Hindi and Korean that are written in non-Latin scripts and for which GPT-4 performs better on MMLU, SUTRA’s tokenizer generates less than half as many tokens as the one used with GPT-3.5 and GPT-4, and slightly fewer than GPT-4o’s tokenizer.

Yes, but: Multilingual MMLU tests only 11 of SUTRA’s 33 languages, making it difficult to fully evaluate the model’s multilingual performance. 

Behind the news: Two AI was founded in 2021 by Pranav Mistry, former president and CEO of Samsung Technology & Advanced Research Labs. The startup has offices in California, South Korea, and India. In 2022, it raised $20 million in seed funding from Indian telecommunications firm Jio and South Korean internet firm Naver. Mistry aims to focus on predominantly non-English-speaking markets such as India, South Korea, Japan, and the Middle East, he told Analytics India.

Why it matters: Many top models work in a variety of languages, but from a practical standpoint, multilingual models remain a frontier in natural language processing. Although SUTRA doesn’t match GPT-4 in all the languages reported, its low price and comparatively high performance may make it appealing in South Asian markets, especially rural areas where people are less likely to speak English. The languages in which SUTRA excels are spoken by tens of millions of people, and they’re the most widely spoken languages in their respective regions. Users in these places have yet to experience GPT-4-level performance in their native tongues.

We’re thinking: Can a newcomer like Two AI compete with OpenAI? If SUTRA continues to improve, or if it can maintain its cost-effective service, it may yet carve out a niche.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox