Research

A Researcher Used AI for Whale Science. It Fabricated the Dataset.

March 25, 2026 3 min read

Three rounds of AI-assisted research. Three different failures. And each one was harder to catch than the last.

Researcher Ryan Endacott set out to investigate a genuine scientific question: why do pilot whales mass-strand at specific beaches like Farewell Spit in New Zealand and Cape Cod, while nearby coastlines remain unaffected? He used AI models to generate hypotheses, gather data, and run statistical analyses. The project ran on a Claude Max subscription at roughly $200 per month, pulling from public data APIs. What he documented is a detailed case study in how AI can be confidently, precisely wrong.

Round One: Invented Measurements and Fake Citations

The first AI proposed that magnetic field gradients create navigational traps for whales. Plausible-sounding hypothesis. The problem: the supporting data was fabricated.

The model reported Farewell Spit's magnetic field at roughly 52,586 nanoTesla (a unit for measuring magnetic field strength). The actual measurement is approximately 56,270 nT - off by about 3,700 nT. Coordinates for Tasmania were wrong by 104 kilometers. Coordinates for Matagorda-Padre Island in Texas missed by 99 kilometers. The AI also cited "NOAA WMM-2010" as a source - a citation format that looks legitimate but pointed to nothing useful. The numbers even contained internal contradictions, where values within the same data structure didn't add up.

None of these errors would be obvious to someone who hadn't independently checked every data point. The outputs looked like clean, properly structured scientific data.

Round Two: Real Data, Wrong Conclusion

Endacott switched to Claude Opus 4.6 and Claude Code for the correction phase. This time, the AI installed the ppigrf library using official IGRF-14 geomagnetic coefficients (the standard reference database for Earth's magnetic field measurements) and verified data across 15 sites. It tested eight hypotheses with proper statistical methods and found no correlation between magnetic gradients and strandings - the t-statistic was -0.007, essentially zero.

The AI then analyzed 20 years of satellite data covering sea surface temperature, chlorophyll levels, and wind patterns. It produced a temporal risk model with a statistically significant t-statistic of 8.09. Real progress, apparently.

Round Three: The Control Group Had No Whales

Then Endacott noticed the fatal flaw. The AI had selected control sites - beaches used for comparison where strandings don't happen - that had no pilot whale populations at all. The Dutch Wadden Sea is only 3 meters deep; pilot whales need at least 500 meters and prefer depths over 1,000 meters. Matagorda-Padre Island and Banc d'Arguin in Mauritania have no documented pilot whale presence whatsoever.

This is like studying why certain restaurants get food poisoning outbreaks by comparing them to buildings that don't serve food. The statistical significance was meaningless because the experimental design was broken at its foundation.

This is the failure mode that should concern anyone using AI for analytical work. Round one produced obvious fabrications that careful checking would catch. Round two produced legitimate data with a hidden structural flaw that required actual domain expertise to identify. Each iteration looked more credible while still being fundamentally wrong.

The practical finding, after all three rounds, was modest: stranding months tend to be windier than usual, with lower chlorophyll levels that might indicate prey concentrating near shore. Getting there required a researcher who knew enough about whale biology and oceanography to catch errors that the AI presented with full confidence.

For anyone using AI as a research assistant - including market research, competitive analysis, or data-driven content work - the lesson is specific: AI doesn't just hallucinate when it lacks data. It produces fabrications that are structurally identical to real data, complete with plausible precision and proper-looking citations. The more capable the model, the harder these errors are to spot without independent verification of every claim.

Round One: Invented Measurements and Fake Citations

Round Two: Real Data, Wrong Conclusion

Round Three: The Control Group Had No Whales

Related Tools

More from today

ICML 2026 Watermarked Papers to Catch AI-Generated Reviews, Rejected 497

Anthropic Data: AI Power Users Are Pulling Away From Everyone Else

Google's TurboQuant Shrinks AI Memory Usage 6x With Zero Accuracy Loss

Cookie Preferences