TL;DR: The llama.cpp b9499 release standardizes quantization and refactors FlashAttention for the WebGPU backend.
Summary: Release b9499 of llama.cpp introduces a major refactor of FlashAttention within the ggml-webgpu backend. The update standardizes and abstracts quantization logic across flash_attn and mul_mat, and splits key/value quantization. It also reduces the Metal backend heartbeat interval and fixes Gemma 4 unified FPE issues.
Why it matters: Standardized quantization and optimized FlashAttention on WebGPU pave the way for high-performance, browser-based local model execution. Developers can leverage these improvements to build more responsive and memory-efficient client-side AI applications.
Source: github.com