What Happened
A security researcher tested OpenAI's GPT-5.4 with Strix, an open-source autonomous AI agent for web penetration testing, against three Hack The Box machines - intentionally vulnerable systems used for security training. The results were underwhelming.
GPT-5.4 completed one machine quickly but failed on the other two. In both failure cases, the model identified initial attack vectors and generated reports about vulnerabilities but never actually followed through on exploitation. It found the doors but didn't walk through them.
This is a notable regression from GPT-5.3 Codex, which completed all three machines successfully in previous testing. The performance gap was significant and unexpected, since 5.4 is the newer model.
The likely explanation comes down to model design. GPT-5.4 is a general frontier model optimized for professional tasks - clean, efficient, fewer iterations. GPT-5.3 Codex was tuned specifically for long-horizon agentic and coding tasks, which is exactly what autonomous pen testing requires: multi-step chains where each action depends on the results of the last.
Why It Matters
This is a real-world demonstration of something the AI tools community keeps running into: newer doesn't always mean better for your specific use case. Model selection for agentic workflows can't be based on version numbers or general benchmarks alone.
The failure pattern is instructive. GPT-5.4 didn't crash or produce garbage. It did the analysis correctly. It just stopped short of completing the multi-step exploitation chain. That's a planning and persistence problem, not a capability problem. The model had the knowledge but lacked the drive to push through a long sequence of dependent actions.
This matters for anyone building autonomous workflows - not just in security, but in coding agents, research assistants, or any tool that needs to chain 10+ steps together without human intervention. The model powering your agent needs to be tuned for that kind of sustained, multi-step reasoning. A model optimized for single-turn professional responses may actually perform worse than an older model built for agentic work.
Our Take
This result should make anyone using AI agents pause and think about model selection more carefully. The assumption that the newest model is the best model is wrong. GPT-5.3 Codex outperformed GPT-5.4 on agentic tasks because it was built for agentic tasks. That's not a flaw in 5.4 - it's a feature mismatch.
We see the same pattern across the industry. Claude's Opus models handle deep analysis differently than Sonnet. Cursor works differently depending on which model backs it. The model-agent fit matters as much as raw model capability.
The practical lesson: if you're running autonomous AI workflows, benchmark your specific use case against multiple models before committing. Don't just grab the latest release and assume it'll be better. Test with your actual tasks, measure completion rates on multi-step chains, and pick the model that finishes the job - not the one with the highest version number.