Related ToolsChatgptClaudeClaude For Desktop

AI Models Score Up to 3x Worse in Non-English Languages, Research Shows

AI news: AI Models Score Up to 3x Worse in Non-English Languages, Research Shows

1.52 billion people speak English. AI works pretty well for them. For the 97 million Vietnamese speakers, it works noticeably worse. For the 1.5 million speakers of Nahuatl, an indigenous language in Mexico, it barely works at all.

That's the core finding from research by Tuka Alhanai at New York University Abu Dhabi and Mohammad Ghassemi at Michigan State University, presented at the AAAI Conference on Artificial Intelligence and recently highlighted by The Economist. Their benchmarking work confirms what many non-English speakers already suspected: the most capable AI models in the world are built primarily for English, and the gap is not small.

The Training Data Problem

The root cause is straightforward. Large language models learn from text, and most of the internet's text is in English. Languages with less online presence - what researchers call "low-resource languages" - simply don't have enough quality training data. Models trained mostly on English text can technically respond in Vietnamese or Arabic, but their accuracy, nuance, and cultural awareness drop sharply.

This isn't just a matter of awkward phrasing. A separate MIT study published in February 2026 put hard numbers on the problem. Researchers tested GPT-4, Claude 3 Opus, and Meta's Llama 3 across different user profiles, varying education level, English proficiency, and country of origin. Claude 3 Opus refused nearly 11% of questions from less-educated, non-native English speakers, compared to 3.6% for the control group of educated native speakers. That's a 3x refusal gap based on who's asking, not what they're asking.

The MIT team, led by researcher Jad Kabbara, also found that Claude responded with patronizing or condescending language 43.7% of the time for less-educated users, versus under 1% for highly educated users. Geographic bias showed up too: Claude 3 Opus performed significantly worse for Iranian users, refusing questions about topics like nuclear power and historical events.

Who Pays the Price

The people who would benefit most from AI are often the ones getting the worst experience. A farmer in rural Vietnam trying to use an AI chatbot for crop advice gets less reliable information than an English-speaking office worker asking the same question. A student in Mexico practicing with AI gets dumbed-down or refused responses that their American counterpart never encounters.

As Alhanai's research puts it: "The people with the most to gain are the least able to use these tools."

This matters for anyone building products on top of these models, too. If you're running a multilingual customer support bot or deploying AI tools across international teams, the assumption that "it works in English, so it works everywhere" is wrong. Performance degrades in ways that standard testing often misses because most benchmarks are English-first.

What's Actually Being Done

The major labs are aware of the gap. OpenAI, Anthropic, and Google have all expanded multilingual training data in recent model generations, and performance in major world languages like Spanish, French, and Mandarin has improved. But the long tail of languages - the ones spoken by millions but underrepresented online - remains largely unaddressed.

The honest reality: fixing this is expensive and slow. Quality training data in Nahuatl or Yoruba doesn't exist at the scale needed, and generating it requires native-speaker expertise that's hard to source. Until the economics change, non-English speakers will continue getting a meaningfully worse product from every major AI provider.