Most AI agent bugs don't look like bad output. They look like this: you tweak a prompt, the agent's final answer seems fine, but it silently stopped reading config files before deploying, or swapped npm test for npm run build. The output passes a glance check. The behavior is broken.
TracePact is a new open-source testing framework built specifically for this problem. It records a known-good agent run as a "cassette" (a snapshot of every tool call, in order, with arguments), then diffs future runs against that baseline.
The diff output is concrete:
read_file (seq 0) (removed)- the agent stopped reading a file it used to readbash.cmd: "npm test" -> "npm run build"- it changed which command it runs
The tool supports assertions that feel familiar if you've written unit tests: toHaveCalledToolsInOrder(['read_file', 'write_file']), toHaveToolCallCount('read_file', 2), toNotHaveCalledTool('bash'). It also supports MCP (Model Context Protocol) tracing, so you can verify which MCP servers and tools your agent invokes.
The CI integration is the practical selling point. Record a baseline once with npx tracepact run --live --record, then replay in CI with npx tracepact run --replay ./cassettes - no API calls needed for the replay. Set --fail-on warn to break your build when behavioral drift is detected.
TracePact is built in TypeScript with packages for Vitest integration, a CLI, and a Promptfoo adapter. It's MIT-licensed and available on npm as @tracepact/core.
This fills a real gap. Existing LLM evaluation tools focus on output quality - does the response sound right? TracePact focuses on behavior correctness - did the agent do the right things in the right order? For coding agents, ops automation, and workflow tools where the sequence of actions matters as much as the final output, that distinction is critical.