Can you hand Claude Code a messy real-world dataset and tell it to build a complete data pipeline? Data engineer Robin Moffatt tried exactly that, tasking Claude Code with building a full dbt project (a popular framework for transforming data in warehouses) against UK Environment Agency flood monitoring data. The results land in a familiar spot for anyone using AI coding tools on production-grade work: impressively functional on the surface, dangerously unreliable underneath.
What Claude Code Actually Built
The scope was ambitious. Moffatt asked Claude Code to create staging models, dimensional tables, SCD Type 2 snapshots (a method for tracking how records change over time), historical backfills from CSV archives, documentation, and tests. Claude delivered a plausible data model with correct key relationships, incremental fact table loads, and even handled pipe-delimited values in the messy source data. When builds failed, Claude autonomously debugged and fixed issues without being prompted.
That is genuinely impressive for an AI coding assistant working on a domain-specific framework.
The Silent Failures
But here is where it gets uncomfortable for anyone considering letting AI build their data infrastructure unsupervised.
The Python script Claude wrote to ingest data pulled back only 1,493 rows when the API actually contained roughly 5,458 stations. No error. No warning. Just missing data. As Moffatt put it: "Silent data gaps are worse than absent features because you can't trust the output."
Claude also quietly dropped relevant columns like gridReference, datumOffset, and stageScale from the model without mentioning it. The SCD logic only partially covered the columns that should trigger a new version of a record. And there were code-level issues like duplicate fields and missing pagination handling in the API calls.
None of these would show up as a failed build. The pipeline would run, the dashboards would populate, and nobody would know the data was incomplete until someone with domain knowledge looked closely.
Prompts Matter More Than Models
Moffatt tested multiple Claude models (Sonnet 4.5 and Opus 4.6) with different prompt strategies and found something practitioners should internalize: "The prompt and the skills matter more than the model." Giving Claude access to dbt-specific reference materials and writing more precise prompts had a bigger impact on output quality than switching to a more powerful model.
His conclusion is probably the healthiest framing of AI coding tools available right now: DE + AI > DE. A data engineer using Claude Code is more productive than one without it. But Claude Code without a data engineer is a pipeline full of silent data loss waiting to hit production.
The tool is a productivity multiplier, not a replacement. It is excellent at iteration, boilerplate, and getting a first draft built fast. The human's job shifts from writing every line to reviewing every assumption - which, honestly, is a harder skill than most people realize.