Gemma 4 QAT MTP Assistant Heads Released on HuggingFace — Intelligence Feed

TL;DR: Google's Gemma 4 Quantization-Aware Training (QAT) models now have converted draft heads on HuggingFace for accelerated speculative decoding.

Summary: A developer has converted and uploaded three draft heads matching the official Gemma 4 QAT Q4_0 models to HuggingFace under boxwrench/gemma-4-qat-mtp-assistant-heads. Using these assistant heads with speculative decoding (MTP) enabled a 12B QAT model to run at 120 tokens per second on a consumer GPU. Additionally, a crash affecting parallel execution (PARALLEL=2) was identified and fixed in llama.cpp forks and upstreamed.

Why it matters: Indie developers can now run highly accurate quantized Gemma 4 models at interactive speeds on consumer-grade hardware. Keep an eye on llama.cpp updates incorporating the PARALLEL=2 fix to test speculative decoding benchmarks yourself.

Source: r/localllama