Tools Notable

Cloudflare Workers AI Adds Large Model Support, Launches with Kimi K2.5

March 20, 2026 2 min read

For two years, Cloudflare's Workers AI focused on smaller models. That changes now. The company announced support for large, frontier-scale open-source models on its inference platform, with Moonshot AI's Kimi K2.5 as the first available.

Kimi K2.5 is an open-source model built for agent workflows. It supports a 256k token context window (enough to process roughly a 600-page book in a single request), multi-turn tool calling, vision inputs, and structured output. Cloudflare built custom inference kernels for it on their proprietary Infire engine, using techniques like tensor parallelization and disaggregated prefill, where different processing stages run on separate machines to speed up throughput.

The Cost Argument

Cloudflare is leading with price. In an internal case study, they ran a security review agent that processed 7 billion tokens daily. Using proprietary models, that workload would have cost around $2.4 million per year. Running Kimi K2.5 on Workers AI cut that by 77%. The agent also caught more than 15 confirmed security issues in a single codebase, so this is not a case of saving money by downgrading quality.

Cached tokens get a discount over standard input tokens, and a new x-session-affinity header lets developers route requests to the same instance, which improves cache hit rates and reduces time-to-first-token.

Async API for Agent Workloads

The bigger infrastructure change is a redesigned async API. Cloudflare switched from a push-based to a pull-based queuing system that processes requests as soon as capacity opens up, rather than queuing them behind other traffic. Internal testing shows most async requests complete within 5 minutes. This matters for non-real-time agent tasks like code scanning or batch processing, where you do not need a response in milliseconds but you do need it to actually complete without hitting capacity errors.

The trade-off with serverless inference is real, though. You are sharing GPU capacity with everyone else on the platform. For latency-sensitive, high-volume production workloads, dedicated inference will still win. Cloudflare is positioning this for developers who want to build and scale agents without managing GPU infrastructure, not for teams already running optimized inference clusters.

Kimi K2.5 is available now through Workers AI, the Agents SDK starter, and Cloudflare's model playground. Cloudflare has not said which large models come next, but opening the platform to frontier-scale models signals they are serious about competing with dedicated inference providers like Together AI and Fireworks.

The Cost Argument

Async API for Agent Workloads

More from today

Anthropic Launches Claude Dispatch: Assign Tasks From Your Phone, Get Results on Desktop

Your CLAUDE.md File Is Too Long and Claude Is Ignoring It

WordPress.com Ships AI Agents That Can Write and Publish Posts Autonomously

Cookie Preferences