The best text-to-speech models have been locked behind API paywalls from Google and OpenAI. Fish Audio is trying to change that with S2, a new open-source TTS model that the company says outperforms both on independent speech quality benchmarks.
S2's standout feature is emotion control through natural language tags. Instead of picking from a dropdown of preset moods, you write directives like [whispers sweetly] or [laughing nervously] directly in your text, and the model adjusts its delivery accordingly. That level of fine-grained control has typically required expensive studio voice direction or proprietary tools with limited customization.
The Specs
- Latency: 100ms time-to-first-audio, fast enough for real-time applications like voice assistants or live narration
- Languages: 80+ supported
- Multi-speaker: Generate dialogue between multiple voices in a single pass, no need to stitch separate audio clips together
- Benchmarks: Fish Audio claims top scores on the Audio Turing Test (a test measuring whether listeners can distinguish AI speech from human speech) and EmergentTTS-Eval, beating closed-source competitors
Who Should Care
For content creators producing podcasts, audiobooks, or video narration, the emotion tagging alone is a big deal. Current open-source TTS options like Coqui or Bark give you decent quality but limited expressive control. S2 is positioning itself as the model that closes the gap between "sounds human" and "sounds like a human having a specific emotion."
The multi-speaker dialogue feature is particularly useful for anyone generating conversational content. Producing a two-person podcast intro or a product demo with distinct voices currently means running multiple TTS passes and editing them together. Doing it in one shot saves real production time.
Benchmark claims always deserve some skepticism until independent testers confirm them, and "beating OpenAI" on a specific evaluation does not mean it sounds better in every scenario. Production audio quality depends heavily on the specific voice, language, and use case. Still, the model weights are available on Hugging Face, so anyone can run their own comparisons.
At 100ms first-audio latency and open weights, S2 is immediately useful for developers building voice features into apps without wanting to pay per-character API fees. For non-technical users, the real question is how quickly tools and interfaces get built on top of it.