AI tools give measurably worse answers to users who aren't fluent in English or who have less formal education, according to research published in AAAI - one of the top peer-reviewed venues in AI research. The effect isn't a minor statistical artifact. It's a consistent accuracy gap that falls hardest on users who are already at an information disadvantage.
What the Research Found
The study examined how LLM (large language model - the AI systems behind ChatGPT, Claude, Gemini, and similar tools) response quality varied based on how questions were phrased. Users whose English showed markers of lower proficiency - simpler vocabulary, grammatical patterns common in non-native speech, shorter or more fragmented sentences - received less accurate and less truthful answers than users asking equivalent questions in fluent native English.
Education level had a similar effect. Users with less formal schooling, who tend to communicate differently than highly educated users, received systematically worse outputs even when asking about the same topics.
The pattern creates a troubling paradox. The people who arguably stand to gain the most from AI assistance - those with less access to formal education, non-native English speakers, users in countries where English instruction is limited - are getting the least reliable answers from these tools.
Why the Gap Exists
The explanation sits in how these models are trained. LLMs learn from enormous amounts of internet text: academic papers, Wikipedia, news articles, software documentation, books. That corpus skews heavily toward educated, English-first writers. The models develop an implicit picture of what clear, well-formed questions look like based on what they were trained on.
When a model encounters writing patterns that don't match that picture - non-standard grammar, different sentence structures, vocabulary typical of lower-proficiency English - its ability to correctly interpret intent and generate accurate responses degrades. This isn't an intentional design choice. It's a structural consequence of training on data that doesn't represent the full range of how people actually communicate.
The two effects also compound each other. Lower-proficiency English and less formal education tend to co-occur in the same populations globally, so the same users get hit from both directions.
What This Means for Deployed AI Products
For teams deploying AI tools to general consumer audiences or international markets, this is a product quality problem, not just an academic finding. A customer service bot that gives reliably wrong information to Spanish speakers or Hindi speakers is a materially different product than the one tested with native English-speaking QA teams.
Demographic accuracy testing - checking how output quality varies across different user populations - isn't standard practice in most AI deployments. The major AI providers don't publish accuracy metrics broken down by user language proficiency or education level in their standard model evaluations.
ChatGPT reports over 300 million weekly users globally, with a significant share operating in their second or third language. At that scale, even a modest accuracy gap across lower-proficiency users represents an enormous number of wrong or misleading answers delivered daily.
There's no clean fix available. Training on more linguistically diverse data helps but introduces its own tradeoffs. Prompt engineering techniques can improve outputs for lower-proficiency inputs, but require users to know they need to apply them. Some researchers are pushing for evaluation benchmarks that specifically test performance across proficiency levels rather than assuming a single fluent-English user profile. Until that becomes standard, users on the wrong side of this gap have no reliable way to know when they're getting a worse answer.