Related ToolsClaude CodeClaudeCursorAider

Skyvern's Open-Source MCP Server Lets Claude QA Its Own Code Changes

Claude by Anthropic
Image: Anthropic

A PR that passes type checks but ships a broken button is the kind of bug that makes you question automation entirely. Skyvern, the browser automation company, built a system that lets Claude Code catch those bugs itself before a human ever has to.

The setup is an MCP server (a standardized way for AI models to use external tools) packed with 33 browser automation tools covering navigation, form filling, data extraction, and credential management. When a developer finishes a code change, two Claude Code skills kick in: /qa for local development and /smoke-test for CI pipelines.

How the QA Loop Works

The process runs through five phases. Claude analyzes the git diff to understand what changed, classifies the modifications as frontend, backend, or mixed, determines a validation strategy, then executes full QA by actually driving a browser against the local dev server. It clicks buttons, fills forms, checks layouts at different breakpoints, and files a report with screenshots that can be attached directly to a PR.

This is not "ask an LLM if the code looks right." The model is physically interacting with the running application, catching things like unresponsive buttons, form overflow bugs, and layout inconsistencies that static analysis would never flag.

The Numbers

Skyvern reports their one-shot PR success rate (PRs that pass review without being sent back for fixes) jumped from roughly 30% to 70%. QA loop time was cut in half. Those are meaningful improvements for a team shipping code daily.

The entire implementation is open source, with the prompt clocking in at about 700 lines. You can install it with pip install skyvern and run skyvern setup claude-code to wire it into your workflow.

Practical Limits

This works best for UI-heavy applications where visual and interactive bugs are the main failure mode. Backend-only services with no frontend won't benefit much. And like any AI-driven testing, it catches the obvious regressions but probably misses the subtle business logic errors that require domain knowledge. Still, cutting the feedback loop between "code complete" and "actually works" is exactly where AI coding tools need to go next.