What comes between GPT-5 and GPT-6? According to leaked information posted by developers tracking OpenAI's API infrastructure, the answer may be a model internally codenamed "Spud."
The leaked details describe GPT-5.5 as an omnimodal model - meaning it handles text, images, audio, and video as both input and output within a single unified system, not as separate specialized modes bolted together. Current GPT-4o handles multiple input types but doesn't generate video natively and doesn't produce audio with the same consistency as its text output. The leak suggests GPT-5.5 would close those gaps. Beyond modality expansion, the leak points to improved reasoning performance and reduced response latency compared to GPT-5. No pricing or release timeline was included.
What Unverified Leaks Are Worth
OpenAI has not confirmed GPT-5.5's existence, and "Spud" has not appeared in any official communication. Infrastructure artifacts - API endpoint patterns, internal identifiers found in network traffic - don't map cleanly to shipped products, and OpenAI has adjusted its internal model roadmap significantly in the past.
That said, the general direction described in the leak aligns with where every major AI lab is heading. Google's Gemini 2.0 Flash natively generates images and audio. Meta's roadmap includes unified generation across modalities. A more unified multimodal output model from OpenAI isn't a stretch - it's the obvious next step.
For API developers and ChatGPT users, there's nothing actionable here. If and when GPT-5.5 ships, the relevant factors will be pricing, context window size (how much text and media you can feed in at once), and measurable performance differences - not the leaked codename.