Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Summary
Google DeepMind introduced Gemma 4 12B, a multimodal AI model (a system that processes text, images, and audio together) designed to run efficiently on laptop computers with 16GB of memory. The model uses an encoder-free architecture (meaning it processes images and audio directly without separate translation layers), achieving performance comparable to larger models while reducing memory usage and latency. It supports native audio inputs and includes Multi-Token Prediction drafters to speed up response generation.
Classification
Affected Vendors
Related Issues
Original source: https://deepmind.google/blog/introducing-gemma-4-12b-a-unified-encoder-free-multimodal-model/
First tracked: June 9, 2026 at 02:00 PM
Classified by LLM (prompt v3) · confidence: 95%