Two weeks running a production multi-agent system on local hardware instead of Claude. One RTX 3090, Qwen3.6-27B running through Ollama. Here's what actually happened.
The setup: a multi-agent orchestrator with a lead model, a manager layer, and several sub-agents - the kind of system that breaks complex tasks into parallel workstreams, coordinates outputs, and routes decisions back through a central reasoning model. Claude had been serving as the reasoning layer - the component that decides what each sub-agent should work on and evaluates whether their results are good enough. Replacing it with a local model means all computation happens on local hardware instead of Anthropic's servers: no API costs, no data leaving the machine.
Qwen3.6-27B is a 27-billion-parameter model from Alibaba's Qwen team (parameters are the numerical values that encode the model's reasoning ability - more parameters generally means more capability, but also more memory required). Running it at Q6_K quantization - a compression technique that shrinks memory usage while preserving most accuracy - takes roughly 22GB of VRAM, which barely fits on a 3090's 24GB. The effective context window (how much text the model can read at once) was 32k tokens, or roughly 24,000 words.
Where the Local Model Held Its Own
For well-defined, structured tasks - summarizing documents, extracting data from text, following explicit formatting instructions - Qwen3.6-27B performed comparably to Claude. On shorter tasks, local inference (the actual AI computation) was faster, avoiding the network round-trip to an external API. Running locally also meant parallel sub-agent calls didn't hit rate limits, which matters when an orchestrator fires off multiple requests at the same time.
Where It Broke Down
High-level orchestration is where the gap was clearest. When a sub-agent returned unexpected output or a task needed its plan revised mid-execution, the local model made worse calls than Claude. Reasoning under ambiguity - figuring out what to do when the situation doesn't match the original plan - is exactly where models at this size tend to struggle.
Structured JSON output was also inconsistent. Multi-agent systems rely on formatted data packets to pass results between components, and Claude is notably reliable at sticking to strict output schemas. Qwen3.6-27B drifted from them often enough to require extra validation logic. On a single 3090, generation speed was adequate for sequential tasks but became a bottleneck when the orchestrator needed fast turnaround across multiple simultaneous agents.
The Honest Assessment
Qwen3.6-27B is a real option for simpler multi-agent pipelines where data privacy matters and tasks are well-structured. For complex orchestration that depends on reliable reasoning and consistent output formatting, the gap with Claude is real - not enormous, but meaningful enough to require engineering workarounds.
The more useful takeaway: many AI workflow tasks don't actually require frontier-model reasoning. A local model on consumer hardware can handle a significant portion of the work. Knowing which portion is worth identifying before committing to ongoing API spend.