Related ToolsClaudeChatgptClaude Code

Running the Same Prompt Through 6 AI Tools Beats Trusting Any Single One

AI news: Running the Same Prompt Through 6 AI Tools Beats Trusting Any Single One

What if the best way to use AI isn't picking the right model, but running all of them at once and comparing notes?

That's the argument from Peter van Onselen, a principal engineer who spent months running what he calls an "architectural review council" - feeding identical prompts into Claude, ChatGPT, Gemini, Codex, OpenCode, and GitHub Copilot, then using a separate synthesis agent to compare their outputs. He applied the technique three times: debugging tangled third-party integration issues dating back to July 2024, onboarding external contractors to an unfamiliar codebase, and reviewing architectural proposals spanning four team repositories.

The results were telling. Despite examining the same code and documentation, each agent noticed different things. That's the whole point - no single tool catches everything, but the overlap between them builds real confidence, and the disagreements flag exactly where you need to look closer.

The Harness Matters More Than the Model

Van Onselen's most practical finding will surprise people who spend hours debating Claude vs. GPT vs. Gemini. The scaffolding around the model - how you structure prompts, gather context, and format outputs - produced bigger quality differences than which model was running underneath. OpenCode delivered the best structured results regardless of whether it was running Opus 4.6 or GPT-5.4 under the hood.

Each tool developed what he describes as a "personality." Codex acted like a "grumpy pragmatist" - minimal output, laser-focused on the task. Claude was thorough but scattered and conversational. These aren't quirks to work around; they're features when you're triangulating across multiple perspectives.

The Process

The workflow is straightforward: gather context from documentation (Confluence, Slack, Google Drive, repos), structure one comprehensive prompt, run it through each agent independently, save the structured outputs, then feed everything into a synthesis agent that aggregates and compares. The final step is manual validation - reviewing the actual code and consulting other engineers.

This isn't a "set it and forget it" automation. Van Onselen is clear that working with AI at this intensity is "cognitively expensive in ways that people underestimate." You're not removing the thinking - you're amplifying it. But he also says it "let me do things I could not have done otherwise," which is a more honest framing than most AI productivity claims.

Who Should Try This

This approach makes the most sense for high-stakes decisions where hallucinations are dangerous - architectural reviews, security audits, complex debugging. Running six agents for a routine code review would be overkill. But when you need confidence that your AI-assisted analysis is actually correct, comparing independent outputs is currently the most reliable validation method we have. The cost is time and cognitive load; the payoff is catching the blind spots that any single model will inevitably have.