llama.cpp adds Gemma 4 MTP support

AI-Agents LocalAI Tools

TL;DR: A new pull request in llama.cpp enables Gemma 4 Multi-Token Prediction (MTP) support, allowing significantly faster inference times for local deployments.

Summary: Pull Request #23398 in the ggml-org/llama.cpp repository introduces compatibility for Gemma 4 Multi-Token Prediction (MTP). Early testers running Gemma-4-31B QAT models note that the integration allows loading assistant heads for accelerated next-token predictions, although some GGUF key naming adjustments (such as mapping next-token prediction parameters) were required.

Why it matters: AI builders can now run Gemma 4 models locally with hardware-accelerated speculative decoding via MTP. Developers should test the PR branch with Gemma 4 assistant heads to achieve lower latency on consumer GPUs.

Source: r/localllama