Related ToolsCursorCodyContinueClaude CodeAider

dlgo: Pure Go LLM Inference Engine Hits 48 Tokens/Sec With Zero Dependencies

Editorial illustration for: dlgo: Pure Go LLM Inference Engine Hits 48 Tokens/Sec With Zero Dependencies

What Happened

Developer Mohd Ali released dlgo, a deep learning inference engine written entirely in Go that runs quantized LLM models on CPU with zero external dependencies. The project supports LLaMA, Qwen 2/3/3.5, Gemma 2/3, Phi-2/4, SmolLM2, Mistral architectures, and Whisper for speech-to-text.

Performance numbers on single-threaded CPU with Q4_K_M quantization: LLaMA 3.2 1B runs at roughly 31 tokens per second, Qwen models hit 30 to 40 tokens per second, and Gemma models manage 12 to 18 tokens per second. The headline 48 tokens per second figure comes from the faster Qwen configurations. Whisper transcription runs at approximately 1x realtime.

The entire integration is a single Go import: go get github.com/computerex/dlgo. It loads GGUF model files directly, supports 25+ quantization formats, and includes streaming output, multi-turn chat, and configurable sampling (temperature, top-K, top-P, repetition penalties). Optional SIMD acceleration via CGo (AVX2/FMA) is available but not required.

The project is Apache 2.0 licensed, currently at 5 commits with a single contributor, and is 93.3% Go with 6.7% C for the optional SIMD paths.

Why It Matters

Go developers who want local LLM inference have historically had two options: shell out to Python, or link against llama.cpp through CGo bindings. Both add complexity to builds, deployment, and dependency management. dlgo removes that friction entirely.

This matters most for Go-based developer tools, CLI applications, and backend services where adding Python or C++ dependencies creates real operational overhead. Think of scenarios like: an internal CLI that uses a small local model for code suggestions, a Go microservice that does quick text classification without hitting an API, or adding speech-to-text to a Go application without pulling in external binaries.

The performance is not competitive with llama.cpp on equivalent hardware. But for small models (under 2B parameters) running inference on modest tasks, 30+ tokens per second on CPU is usable for many applications.

Our Take

This is a solid engineering project that fills a genuine gap. The Go ecosystem has been underserved for local AI inference compared to Python and C++. Having a pure Go option with zero dependencies means you can add LLM capabilities to a Go project the same way you would add any other library - no Docker containers, no Python virtual environments, no build system gymnastics.

The practical sweet spot is small models for specific tasks: text classification, simple chat, summarization, and speech-to-text via Whisper. You are not going to run a 70B model through this. But pairing a Qwen 0.5B or SmolLM2 360M with a Go application for targeted inference tasks is now trivially easy.

The main concern is sustainability. One contributor, five commits, no releases. If you are building something production-critical on this, you are betting on a solo maintainer. But as a reference implementation and for side projects, dlgo is worth watching. The architecture is clean, the model support is broad, and the "just go get it" developer experience is exactly what was missing.