TL;DR: Gemma 4 12B Quantization-Aware Training (QAT) models can achieve 120 tokens per second on a single 12GB GPU using speculative decoding and Unsloth's quants.
Summary: Google released Quantization-Aware Training (QAT) variants of their Gemma 4 models. By running Unsloth's GGUF quantization of Gemma 4 12B with speculative decoding (Multi-Token Prediction) using Google's QAT assistant draft model, developers can run high-speed inference locally on consumer-grade 12GB GPUs. The setup utilizes llama.cpp for conversions and achieves 120 tokens per second.
Why it matters: It demonstrates that highly accurate 12B parameter models can run at production-level speeds on consumer hardware. AI builders should explore using QAT-optimized draft models and GGUF speculative decoding to drastically lower local LLM latency and hosting costs.
Source: r/localllama