Llama.cpp Adds Gemma 4 Multi-Token Prediction Support

LocalAI OpenSource Architecture

TL;DR: Llama.cpp has added support for Gemma 4 Multi-Token Prediction, enabling faster local model inference.

Summary: The b9551 release of llama.cpp introduces native support for Gemma 4's Multi-Token Prediction (MTP) architecture. This allows developers to leverage MTP models for improved generation throughput during local execution. The release also includes optimizations to avoid copying key-value cells in the KV cache.

Why it matters: Multi-token prediction significantly accelerates local LLM inference speeds without losing generation quality. Builders can leverage this update to run optimized Gemma 4 workflows on edge hardware.

Source: github.com