Every tool call an AI agent makes carries hidden overhead: a fresh HTTP connection opens, authentication headers get sent, and the model waits for the server before any actual work begins. For simple chatbots, that overhead is invisible. For agentic workflows that make dozens of sequential API calls - checking a file, running a command, reading the output, iterating - it compounds into something real.
OpenAI published a technical breakdown of how they tackled this in the Codex agent loop, using two techniques: WebSockets and connection-scoped caching.
What WebSockets Do Here
A WebSocket is a persistent, two-way connection between a client and a server. Unlike standard HTTP requests - which open a connection, exchange data, and close - a WebSocket stays open. The client and server can exchange messages without re-establishing the connection each time.
For an agent making many sequential calls, the math is simple: fewer connection setups means less cumulative latency. Codex, OpenAI's coding agent, operates by reading files, running tests, checking results, and repeating. A persistent connection means each step in that loop doesn't pay the setup cost of a fresh HTTP request.
What Connection-Scoped Caching Changes
When you call a language model, you typically send context along with the request: system instructions, prior conversation history, tool definitions. With standard connections, each call re-sends that context and the model processes it from scratch. Connection-scoped caching means the API retains that context for the duration of a WebSocket session.
Subsequent calls within the same connection can skip re-processing the parts that haven't changed. For Codex, which operates within a consistent environment across many steps, this reduces both cost and latency on every call after the first.
Multi-Step Agents Gain the Most
If you're building on the Responses API with single-turn interactions, this engineering doesn't change much for you. But if your use case involves agent loops - anything where the model takes sequential actions using tools - the infrastructure choices here matter.
The broader lesson from OpenAI's post is that perceived latency in agentic applications often has less to do with model speed and more to do with connection management. Developers building tools like Claude Code face the same fundamental tradeoff, and this kind of connection architecture will matter more as agents handle longer, more complex tasks.
The write-up is worth reading as a reference implementation for anyone building multi-step pipelines on the Responses API, even outside of Codex specifically.