Three out of four core LLM capabilities are production-ready. The fourth - reasoning - is, as one AI developer bluntly put it, "an absolute minefield of hallucinations." That gap explains why most general-purpose AI agents still fall apart in real-world use.
The argument, laid out in a detailed technical post that's gaining traction among developers, breaks down what text-based LLMs can actually do into four pillars: Natural Language Understanding (parsing what you say), Natural Language Generation (writing coherent responses), Tool Calling (triggering external actions like API calls or database queries), and Reasoning (drawing logical conclusions across multiple steps). The first three are mature and reliable. The last one is not.
The Steam Engine Analogy
The author compares the current state of AI to the earliest steam engines - "bulky, stationary, and only good for pumping water out of coal mines." We are not at the locomotive stage. We are not even close. And yet the industry keeps trying to build locomotives.
This framing lands because it matches what most people experience when they try to use AI agents for complex, multi-step work. The agent starts strong, makes a few tool calls, then veers off course when it needs to chain together logical steps without direct prompting. Manus, the buzzy AI agent platform, gets called out specifically as "a textbook example" of hype meeting reality - impressive demos that crumble under real deployment conditions.
The result is what the post describes as a "cottage industry" of AI agents: thousands of custom-built, narrowly-scoped bots that each solve one specific problem. "It's like every household having its own loom, weaving its own cloth, and never using anyone else's."
A Practical Workaround, Not a Fix
The prescription is simple and deliberately conservative: build production agents using only NLU, NLG, and tool calling. Skip reasoning entirely. Design your agent workflows so that the logic lives in your code, not in the LLM's head. Let the model do what it's good at - understanding language, generating responses, and calling tools when asked - while your application handles the actual decision-making.
This is a real trade-off. You lose the magic of an agent that can "think through" problems on its own. You gain reliability. For anyone shipping an AI-powered product to paying customers, that trade-off makes sense right now.
The counterargument is obvious: reasoning is improving fast. OpenAI's o-series models, Anthropic's extended thinking in Claude, and Google's Gemini reasoning modes all represent meaningful progress. But "improving fast" and "production-ready" are different things, and the gap between them is exactly where most agent failures happen.
For teams building AI tools today, this is worth internalizing. The most reliable AI agents are not the ones that reason the hardest - they are the ones that reason the least, delegating logic to deterministic code wherever possible.