TL;DR: Hugging Face Transformers v5.12.0 has been released, introducing native integration for the high-capacity MiniMax-M3-VL vision-language model.
Summary: The v5.12.0 release adds MiniMax-M3-VL, which pairs a CLIP-style vision tower with the MiniMax-M3 text backbone. The model employs MiniMax Sparse Attention (MSA), a blockwise sparse attention mechanism built on Grouped Query Attention, alongside a mixed dense/sparse MoE decoder with SwiGLU-OAI gated experts. It features approximately 428 billion total parameters with 23 billion active parameters to enable ultra-long-context multimodal reasoning.
Why it matters: This integration democratizes access to a massive, highly efficient sparse MoE model optimized for ultra-long-context multimodal understanding. Builders should test MiniMax-M3-VL for repository-scale code reasoning and complex agentic memory tasks.
Source: github.com