60%. That's how often frontier AI models will confidently describe an image they were never shown. Push them a little harder with prompting, and that number climbs to 90-100%.
A new Stanford paper called MIRAGE, led by researchers including computer vision pioneer Fei-Fei Li, tested what happens when you send visual questions to multimodal AI models (models that process both text and images) but deliberately leave the image out. The models don't say "I don't see an image." They invent one.
GPT-5, GPT-5.1, GPT-5.2, Gemini 3 Pro, Gemini 2.5 Pro, Claude Opus 4.5, and Claude Sonnet 4.5 were all tested. Every single one fabricated detailed descriptions of images that were never sent - inventing specific car license plates, medical diagnoses, expiration dates, and brain scan findings out of thin air.
The Benchmark Problem Is Worse Than the Hallucinations
The fabrication itself is alarming, but the bigger finding is what it reveals about how we measure AI vision capabilities. The researchers created a test called Phantom-0: 200 visual questions pulled from established benchmarks, sent without any images attached.
When they used this approach to audit three major benchmarks, the results were brutal:
- MicroVQA: 77% of questions could be answered without seeing an image. Only 240 of 1,042 questions survived the cleanup.
- MedXpertQA-MM: 74.3% removed
- MMMU-Pro: 75.3% removed
After removing questions that models could answer without visual input, accuracy scores collapsed. GPT-5.1 dropped from 61.5% to 15.4% on MicroVQA. Gemini 3 Pro fell from 68.8% to 23.2% on the same test. The leaderboard scores that labs publish to demonstrate visual understanding are dramatically inflated.
A Text-Only Model Beat Radiologists
The most striking result: the researchers trained a 3-billion parameter text-only model (Qwen 2.5, which has zero ability to process images) on chest X-ray question-answer pairs with the X-ray images completely removed. This blind model outperformed every frontier multimodal model on the held-out test set and beat the average radiologist score by more than 10 percentage points.
That means published benchmark results for medical AI vision may reflect pattern matching on question text, not actual image understanding.
Real Consequences for Medical AI
The models showed a consistent bias toward severe diagnoses. When fabricating medical findings, they favored conditions like STEMI heart attacks, melanoma, and carcinoma over normal results - even though normal findings are far more common in clinical data.
A model that "reads" a chest X-ray it never received and confidently reports a serious finding is not a theoretical risk. Medical AI tools built on these models are already being tested in clinical settings, and their published accuracy numbers appear to be built on compromised benchmarks.
The researchers also found something telling about how models behave: when explicitly told "the image has been removed, take your best guess," performance dropped significantly. The hallucination isn't simple pattern matching. The models construct what the paper calls a "false epistemic frame" where they genuinely process the question as if an image is present.
The proposed fix, called B-Clean, is a decontamination framework that strips out questions answerable without visual input. The researchers are calling for private benchmarks that eliminate the textual cues enabling non-visual inference. Given that up to 77% of current benchmark questions are compromised, the AI vision leaderboards are due for a serious correction.