Research Notable

Stanford Finds AI Vision Models Hallucinate Images 60-100% of the Time

March 30, 2026 3 min read

60%. That's how often frontier AI models will confidently describe an image they were never shown. Push them a little harder with prompting, and that number climbs to 90-100%.

A new Stanford paper called MIRAGE, led by researchers including computer vision pioneer Fei-Fei Li, tested what happens when you send visual questions to multimodal AI models (models that process both text and images) but deliberately leave the image out. The models don't say "I don't see an image." They invent one.

GPT-5, GPT-5.1, GPT-5.2, Gemini 3 Pro, Gemini 2.5 Pro, Claude Opus 4.5, and Claude Sonnet 4.5 were all tested. Every single one fabricated detailed descriptions of images that were never sent - inventing specific car license plates, medical diagnoses, expiration dates, and brain scan findings out of thin air.

The Benchmark Problem Is Worse Than the Hallucinations

The fabrication itself is alarming, but the bigger finding is what it reveals about how we measure AI vision capabilities. The researchers created a test called Phantom-0: 200 visual questions pulled from established benchmarks, sent without any images attached.

When they used this approach to audit three major benchmarks, the results were brutal:

MicroVQA: 77% of questions could be answered without seeing an image. Only 240 of 1,042 questions survived the cleanup.
MedXpertQA-MM: 74.3% removed
MMMU-Pro: 75.3% removed

After removing questions that models could answer without visual input, accuracy scores collapsed. GPT-5.1 dropped from 61.5% to 15.4% on MicroVQA. Gemini 3 Pro fell from 68.8% to 23.2% on the same test. The leaderboard scores that labs publish to demonstrate visual understanding are dramatically inflated.

A Text-Only Model Beat Radiologists

The most striking result: the researchers trained a 3-billion parameter text-only model (Qwen 2.5, which has zero ability to process images) on chest X-ray question-answer pairs with the X-ray images completely removed. This blind model outperformed every frontier multimodal model on the held-out test set and beat the average radiologist score by more than 10 percentage points.

That means published benchmark results for medical AI vision may reflect pattern matching on question text, not actual image understanding.

Real Consequences for Medical AI

The models showed a consistent bias toward severe diagnoses. When fabricating medical findings, they favored conditions like STEMI heart attacks, melanoma, and carcinoma over normal results - even though normal findings are far more common in clinical data.

A model that "reads" a chest X-ray it never received and confidently reports a serious finding is not a theoretical risk. Medical AI tools built on these models are already being tested in clinical settings, and their published accuracy numbers appear to be built on compromised benchmarks.

The researchers also found something telling about how models behave: when explicitly told "the image has been removed, take your best guess," performance dropped significantly. The hallucination isn't simple pattern matching. The models construct what the paper calls a "false epistemic frame" where they genuinely process the question as if an image is present.

The proposed fix, called B-Clean, is a decontamination framework that strips out questions answerable without visual input. The researchers are calling for private benchmarks that eliminate the textual cues enabling non-visual inference. Given that up to 77% of current benchmark questions are compromised, the AI vision leaderboards are due for a serious correction.

The Benchmark Problem Is Worse Than the Hallucinations

A Text-Only Model Beat Radiologists

Real Consequences for Medical AI

Related Tools

More from today

1 in 5 AI-Generated Security Patches That Pass Tests Still Break Production

BCG Study: 14% of AI-Using Workers Report 'Brain Fry' Cognitive Overload

Adding Verification Steps to AI Coding Agents Made Them Worse in 29 Tests

Cookie Preferences