Building llama.cpp with OpenBLAS increases VRAM context capacity

OpenSource

TL;DR: Compiling llama.cpp with OpenBLAS and Vulkan backends together reportedly increases VRAM context capacity by up to 28% compared to Vulkan alone.

Summary: A developer reported that building llama.cpp with both Vulkan and OpenBLAS backends allowed a significantly larger context size to fit in VRAM. Specifically, context capacity grew from approximately 87,808 tokens to over 112,896 tokens when running a Qwen 27B model. The exact mechanism or whether this represents a bug or expected behavior remains unconfirmed.

Why it matters: For developers running local LLMs on resource-constrained hardware, combining backends might offer a way to squeeze out larger context windows. Builders should test multi-backend compilations of llama.cpp to see if these memory savings replicate on their hardware.

Source: r/localllama


原文 (Original):

I've noticed a weird difference when building llama.cpp with the Vulkan and OpenBLAS backends vs. building with the Vulkan backend only. It seems like llama.cpp can fit significantly more context in V