Research Notable

50 LLMs Took 45 Psychology Tests. The Results Aren't Personality.

May 7, 2026 2 min read

Researchers gave 45 psychological questionnaires - the kind used to measure personality traits, cognitive styles, and mental health markers in humans - to 50 different large language models. The conclusion: what these tests appear to capture in AI has nothing to do with what they measure in people.

Psychometric tests like the Big Five personality inventory work by detecting consistent patterns in how people answer questions about their own behavior and preferences. The assumption is that those patterns reflect real, stable underlying traits. When researchers apply the same tests to LLMs (large language models - the AI systems that generate text by predicting likely next words based on patterns in training data), the models produce responses that look superficially like personality. The question the researchers asked was: what do those responses actually reflect?

What the Scores Actually Measure

The answer is training choices. When GPT-series models consistently score high on agreeableness or openness to experience, that's the model reproducing what its training reinforced: being cooperative, polite, and helpful. It's a fine-tuning artifact, not a trait. Fine-tuning is the additional training phase that happens after a model's initial training, where developers shape its behavior - typically toward being helpful, cautious, and non-confrontational.

The researchers found significant variation between models, but that variation tracked training methodology, not anything analogous to human psychological differences. Two models built from the same base architecture but fine-tuned differently produced measurably different "personalities." The tests are measuring the fine-tuning process.

Human psychometric tests assume the test-taker has experienced the situations being asked about and is reporting on real behavior. An LLM hasn't experienced anything. When it answers "I tend to plan carefully before starting a project," it's generating a contextually plausible response, not describing behavior.

The Practical Problem With the Personality Frame

Claude and ChatGPT both exhibit what feels like consistent character - Claude tends toward thorough, cautious responses; ChatGPT tends toward directness. Those tendencies are real in the sense that they affect outputs reliably and predictably. The mistake is treating them as "personality" in the human sense, with the implication of stable preferences and coherent values that hold across contexts.

That framing creates practical problems. It leads people to trust models in ways tied to how agreeable they seem rather than how accurate they are. A model trained to be cooperative and conscientious isn't more reliable - it's just been trained to present that way, which can actually make overconfident hallucinations harder to spot.

Behavioral tendencies in AI tools matter and are worth understanding. The psychology metaphor borrowed from human self-report research just isn't the right frame for what those tendencies actually are.

What the Scores Actually Measure

The Practical Problem With the Personality Frame

Related Tools

More from today

Anthropic's Mythos Found High-Severity Firefox Bugs That Years of Auditing Missed

Anthropic Details New Training Stage That Makes AI Alignment Actually Generalize

Fake Privacy Filter Model on Hugging Face Confirmed as Credential-Stealing Malware

Cookie Preferences