TL;DR: New Gemma 4 variants have been released, introducing an encoder-free multimodal architecture that simplifies deployment.
Summary: The update adds Gemma 4 Unified+, Gemma 4 MTP, and Gemma 4 12B Unified models. The 12B Unified model projects raw inputs directly into the language model's embedding space through lightweight linear pipelines instead of using dedicated encoder towers. This results in a simpler architecture while maintaining strong multimodal capabilities.
Why it matters: Eliminating dedicated encoder towers reduces model complexity and simplifies the multimodal inference pipeline for local deployments. Developers should evaluate these models for resource-constrained environments where standard multimodal architectures are too heavy.
Source: r/localllama