Related ToolsClaude CodeWispr FlowDragon NaturallyspeakingCursor

Sumi: Open-Source Voice-to-Text Tool Built for AI Coding Workflows

AI news: Sumi: Open-Source Voice-to-Text Tool Built for AI Coding Workflows

Running three or four AI coding agents at the same time creates a surprising bottleneck: typing fast enough to keep up with all of them. That problem led a developer in Taiwan to build Sumi, an open-source voice-to-text tool that handles both speech recognition and text cleanup locally on your machine.

Sumi uses a two-stage pipeline. First, it transcribes your speech using either Whisper (the open-source model from OpenAI, with seven model size options) or Qwen3-ASR, a newer speech recognition model from Alibaba. The developer wrote the entire inference pipeline (the process of running the AI model to get results) in pure Rust for speed, and quantized the Qwen3-ASR model (compressed it to use less memory while keeping accuracy). According to the developer, Qwen3-ASR handles accented speech and regional dialects better than Whisper does.

The second stage is where it gets interesting for daily users. After transcription, a local language model cleans up the raw text - fixing grammar, removing filler words, and polishing the output into something you can paste directly into a chat or code editor. Both stages run on your hardware, so nothing leaves your machine.

Who This Is Actually For

The obvious audience is developers who talk to AI coding assistants like Claude Code or Cursor. But the local-first approach matters for anyone handling sensitive information. Lawyers dictating case notes, medical professionals, or anyone working with proprietary data gets voice input without sending audio to a cloud API.

The tradeoff is setup complexity. You need a machine with enough power to run these models locally, and Rust-based tooling is still less polished than commercial alternatives like Wispr Flow or Dragon NaturallySpeaking. But for power users who want full control over their voice input pipeline and already have a capable GPU, Sumi fills a gap that commercial tools have not addressed: voice-to-text designed specifically for the workflow of talking to AI agents all day.

Sumi is available on GitHub under an open-source license.