Related ToolsClaudeChatgptCursorClaude CodeCody

Alibaba's Qwen 3.6-Plus Beats Claude 4.5 Opus on Terminal Coding Benchmarks

Claude by Anthropic
Image: Anthropic

Alibaba's Qwen team quietly dropped Qwen 3.6-Plus as a preview on OpenRouter on March 31, then made it official on April 2. The headline number: it scores 61.6 on Terminal-Bench 2.0 (a test of how well a model can use a command-line terminal to solve real coding problems), beating Claude 4.5 Opus at 59.3. That is the first time a Qwen model has topped Anthropic's flagship on an agentic coding benchmark.

Before anyone crowns a new champion, the full picture is more nuanced.

The Benchmark Scorecard

On SWE-bench Verified (the standard test for fixing real GitHub issues), Qwen 3.6-Plus scores 78.8 versus Claude 4.5 Opus at 80.9. Close but still trailing. On multilingual coding, Gemini 3 Pro leads both at 77.5 versus Qwen's 73.8.

Where Qwen 3.6-Plus clearly leads is document parsing: 91.2 on OmniDocBench v1.5, topping all competitors. It also takes first on RealWorldQA (a visual understanding test) at 85.4 and leads the QwenWebBench Elo rating at 1502.

So the picture is: best-in-class for document understanding and terminal-style coding, competitive but not quite leading on general software engineering tasks, behind on multilingual code.

Specs and Pricing

The model has a 1 million token context window (enough to process roughly 2,500 pages of text in a single prompt) and can output up to 65,536 tokens. It is natively multimodal, meaning it processes images, documents, screenshots, and video alongside text without needing a separate vision model.

The architecture uses a hybrid of linear attention and sparse mixture-of-experts routing (a technique where the model activates only a fraction of its total parameters for each query, keeping costs lower). Alibaba has not disclosed the full parameter count.

Pricing on Alibaba's Bailian platform is roughly $0.29 per million input tokens and $1.65 per million output tokens. It is currently free as a preview on OpenRouter. For comparison, Claude 4.5 Opus runs $15 per million input tokens through Anthropic's API - making Qwen roughly 50x cheaper on input if those preview prices hold at general availability.

Community Voting on Open-Weight Sizes

The Qwen team is running a poll asking which model sizes the community wants released as open-weight versions. Options range from a massive 235 billion parameter (22 billion active) variant down to a tiny 0.6 billion parameter model. Previous Qwen generations (3.0 and 3.5) shipped open-weight models under Apache 2.0 at multiple sizes, and 3.6 appears to be heading the same direction.

That matters because Qwen 3.6-Plus is currently closed-source and API-only. Open-weight releases at the 8B or 14B size would let developers run competitive coding models on their own hardware. Given the benchmark results, a local Qwen 3.6 model that runs on a single GPU could become a serious option for developers who want agentic coding without API costs.

Qwen 3.6-Plus is compatible with third-party coding tools including Claude Code and Cline, so you can swap it into existing workflows. At these prices and performance levels, it is worth testing against whatever model you are currently using for code generation.