TL;DR: Google has released Gemma 4 12B, a compact model featuring integrated multimodal capabilities and tool use without extra processing modules.
Summary: The new Gemma 4 12B model features native vision and audio recognition capabilities, alongside built-in tool use support. Highly efficient for its size, a Q4 quantization of the model runs on approximately 8GB of RAM. Early hands-on reports indicate that its performance is competitive with significantly larger 27B and 31B parameter models.
Why it matters: It brings fast, high-quality multimodal capabilities to consumer-grade local hardware. Developers should test the Q4 quantization for on-device applications requiring vision, audio, and function calling.
Source: r/localllama