AMD GPU owners running local language models got a meaningful speed improvement. Build b9387 of llama.cpp delivers a significant bump in prompt processing performance on AMD's datacenter-grade CDNA architecture chips.
llama.cpp is the most-used open-source tool for running AI language models on your own hardware - it lets you run models like Llama or Mistral locally without sending data to external servers. Prompt processing (PP) is the step where the software reads and encodes your input text before generation starts. Faster PP means shorter delays at the start of a conversation or when you paste in a long document.
What Changed and Who It Affects
The improvement uses MFMA (Matrix Fused Multiply-Accumulate) instructions - specialized matrix math operations built into AMD's CDNA chip architecture. Instead of routing these calculations through a general-purpose compute path, the software now uses dedicated silicon built exactly for this type of matrix multiplication. The result is faster prompt encoding on supported hardware.
The catch: MFMA only exists on CDNA architecture, meaning AMD's MI100, MI200, and MI300 series cards. These are datacenter and workstation-grade chips - not consumer hardware. If you're running llama.cpp on an RX 7900 XTX or a similar consumer AMD GPU, b9387 does not change your performance.
For the portion of the local LLM community running MI-series hardware in self-hosted or enterprise setups, this closes a real gap. ROCm - AMD's GPU computing platform, roughly analogous to NVIDIA's CUDA - has historically had weaker llama.cpp optimization than CUDA. Updates like this incrementally close that parity gap on the hardware tier where AMD is most competitive: high-density inference servers and research workstations.
Community benchmark results are still coming in. If you're running MI-series hardware, the full changelog is on the llama.cpp releases page.