+612.7% improvement on a data extraction task. +217% on a medical benchmark. +54.7% on software engineering problems. Those are the numbers TensorZero is publishing for Autopilot, their automated system that optimizes LLM applications without requiring manual prompt engineering.
How It Works
Autopilot sits on top of TensorZero's open-source LLMOps platform (11,100 GitHub stars) and runs a four-step loop:
- Analyze historical performance data from your existing LLM calls
- Build evaluations - automated judges that can score output quality
- Experiment with different prompt variations and model choices
- Deploy the best variants through A/B tests in production
The system doesn't fine-tune models (a process where you retrain a model on your specific data, which is expensive and slow). Instead, it optimizes through strategic prompt changes that address specific failure modes, and by swapping between models like GPT-5 mini, Claude Haiku 4.5, and Gemini 3 Flash depending on which performs best for each task.
For example, on medical benchmarks, Autopilot discovered that encoding domain-specific knowledge like FHIR patterns (a healthcare data standard) directly into prompts dramatically improved accuracy. On data extraction tasks, it found prompt structures that reduced hallucinated entities.
The Pitch and the Caveats
The benchmark numbers are large, but context matters. The improvements are measured against baseline configurations - a generic prompt with a default model. Any team that has already spent weeks hand-tuning their prompts would see smaller gains. The real value proposition is for teams running LLM features in production who don't have dedicated prompt engineers iterating on every call.
TensorZero is positioning this as continuous optimization: Autopilot watches your production data, identifies where the model struggles, and ships improvements automatically. That's appealing if you're running dozens of LLM-powered features and can't manually tune each one.
A self-serve product is coming, with a waitlist currently open. No pricing has been announced. Given that the underlying platform is open-source, the Autopilot automation layer is likely where TensorZero plans to build its business.