Gemma 4 12B QAT achieves 120 tok/s on 12GB VRAM

AI-Agents Tools

TL;DR: Gemma 4 12B Quantization-Aware Training (QAT) models can achieve 120 tokens per second on a single 12GB GPU using speculative decoding and Unsloth's quants.

Summary: Google released Quantization-Aware Training (QAT) variants of their Gemma 4 models. By running Unsloth's GGUF quantization of Gemma 4 12B with speculative decoding (Multi-Token Prediction) using Google's QAT assistant draft model, developers can run high-speed inference locally on consumer-grade 12GB GPUs. The setup utilizes llama.cpp for conversions and achieves 120 tokens per second.

Why it matters: It demonstrates that highly accurate 12B parameter models can run at production-level speeds on consumer hardware. AI builders should explore using QAT-optimized draft models and GGUF speculative decoding to drastically lower local LLM latency and hosting costs.

Source: r/localllama