llama.cpp b9489 Reserves CUDA Quantized KV-Cache at Startup

OpenSource Research

TL;DR: llama.cpp release b9489 reserves CUDA GPU memory space for quantized KV-caches at startup to prevent runtime out-of-memory errors.

Summary: The b9489 release of llama.cpp introduces an update that reserves CUDA memory space for quantized KV-caches at startup. This pre-allocation helps avoid fragmentation and memory allocation failures during active model inference. The release also includes runtime SVE width optimization in FWHT for CPU-based execution.

Why it matters: This change provides more stable and predictable GPU memory profiles for developers deploying quantized models on NVIDIA hardware. Builders running local LLMs should test this version to ensure smoother scaling under long-context workloads.

Source: github.com