Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks
Our take

For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google's Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba's Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, "open" with asterisks isn't the same as open.
Gemma 4 eliminates that friction entirely. Google DeepMind's newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem.
No custom clauses, no "Harmful Use" carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over.
The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research.
Four models, two tiers: Edge to workstation in a single family
Gemma 4 arrives as four distinct models organized into two deployment tiers. The "workstation" tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model — both supporting text and image input with 256K-token context windows. The "edge" tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows.
The naming convention takes some unpacking. The "E" prefix denotes "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more.
The "A" in 26B A4B stands for "active parameters" — only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model.
For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference — but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle.
The MoE bet: 128 small experts to save on inference costs
The architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B–31B range while running at roughly the speed of a 4B model during inference.
This is not just a benchmark curiosity — it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family.
Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi-turn agent conversations.
Native multimodality: Vision, audio, and function calling baked in from scratch
Previous generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level.
All four models handle variable aspect-ratio image input with configurable visual token budgets — a meaningful improvement over Gemma 3n's older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task.
Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots.
The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local — think healthcare, field service, or multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification.
Function calling is also native across all four models, drawing on research from Google's FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4's function calling was trained into the model from the ground up — optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents.
Benchmarks in context: Where Gemma 4 lands in a crowded field
The benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 — numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%.
For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode.
The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond — a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture.
The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability.
These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in a single model family with deployment options from edge devices to cloud serverless.
What enterprise teams should watch next
Google is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially.
The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference — rather than maintaining always-on GPU instances — could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications.
Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today — workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research — represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google's open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Google Opens Gemma 4 Under Apache 2.0 with Multimodal and Agentic CapabilitiesGoogle has announced the release of Gemma 4, a series of open-weight AI models, including variants with 2B, 4B, 26B, and 31B parameters, under the Apache 2.0 license. Key features include enhanced video and image processing, audio input on smaller models, and extended context windows up to 256K tokens. By Hien Luu
- Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasksXiaomi, the Chinese firm best known for its smartphones and electric vehicles, has lately been shipping some incredibly affordable and high-powered open source AI large language models. The trend continued today with the release of Xiaomi MiMo-V2.5 and Xiaomi MiMo-V2.5-Pro, both available under the permissive, enterprise-friendly MIT License, making them suitable for use in production in commercial applications. Enterprises and individual/independent developers can now download either of the models (and more Xiaomi open source options) directly from Hugging Face, modify them as needed, and run them locally or on virtual private clouds as they see fit. The most notable attribute of these models besides the open source licensing is that, according to Xiaomi's published benchmarks, they are among the most efficient available for agentic "claw" tasks, that is, powering systems such as OpenClaw, NanoClaw and Hermes Agent, in which users can communicate with them directly over third-party messaging apps and have the agents go off and complete tasks on the human user's behalf, such as making and publishing marketing content, running accounts, organizing email and scheduling, etc. As Xiaomi's ClawEval benchmark chart shows, both MiMo-V2.5 and the Pro version in particular appear near the top left of the chart, indicating high performance in completing the benchmarked claw tasks while using the fewest amount of tokens — saving the human user money, especially in a world where more and more services such as Microsoft's GitHub Copilot are moving to usage-based billing (charging the human behind the agents for each token used rather than imposing rate limits like Anthropic or providing an "all-you-can-eat" buffet-style subscription like OpenAI). In fact, the Pro model leads the open-source field with a 63.8% success rate, consuming only ~70K tokens per trajectory. This is roughly 40–60% fewer tokens than those required by Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and OpenAI GPT-5.4 to achieve comparable results. By combining a massive 310B-parameter architecture with a highly efficient "active" footprint and a native 1-million-token context window, Xiaomi MiMo is challenging the dominance of closed-source frontier models from Google and OpenAI, especially when it comes to the latest and greatest craze in enterprise AI deployments — agentic tasks and "claws" similar to OpenClaw. A two-pronged pincer Xiaomi has released two distinct versions of the model to serve different ends of the development spectrum: MiMo-V2.5 (the "Omni" multimodal specialist) and MiMo-V2.5-Pro (the "Agent" specialist). While the base model provides native multimodality, the MiMo-V2.5-Pro is specifically engineered for "long-horizon coherence" and complex software engineering. On the GDPVal-AA (Elo) benchmark, the Pro model achieved a score of 1581, surpassing competitors like Kimi K2.6 and GLM 5.1. Xiaomi researchers further released data on several high-complexity tasks performed autonomously by V2.5-Pro: SysY Compiler in Rust: The model implemented a complete compiler from scratch—including lexer, parser, and RISC-V assembly backend—in 4.3 hours. Spanning 672 tool calls, the model achieved a perfect 233/233 score on hidden test suites, a task that typically takes a computer science major several weeks. Full-Featured Video Editor: Over 11.5 hours and 1,868 tool calls, the model produced an 8,192-line desktop application featuring multi-track timelines and an export pipeline. Analog EDA Optimization: In a graduate-level engineering task, the model optimized a Flipped-Voltage-Follower (FVF-LDO) regulator in the TSMC 180nm process. By iterating through an ngspice simulation loop, the model improved metrics like line regulation by 22x over its initial attempt. These experiments highlight a "harness awareness" in V2.5-Pro, where the model actively manages its own memory and shapes its context to sustain coherence over thousands of sequential tool calls. Over the API, Xiaomi is pricing the models at competitive rates for both domestic (Chinese) and international markets (like the U.S.). For overseas developers, the high-performance MiMo-V2.5-Pro is priced at $1.00 per million input tokens (for a cache miss) and $3.00 for output within context windows up to 256K. For ultra-long context tasks between 256K and 1M tokens, the cost doubles to $2.00 for input and $6.00 for output, though the architecture’s caching capabilities offer significant relief, reducing input costs to as little as $0.20 to $0.40 per million tokens upon a cache hit. Domestically, these rates are mirrored in yuan, with the Pro model starting at ¥7.00 per million input tokens for standard context and reaching ¥14.00 for the extended 1M range. Meanwhile, the base model starts at just $0.40 USD for overseas input per million tokens and $2.00 per million output, putting it among the more affordable third of leading LLMs globally (see our chart below): Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI To lower the barrier for agentic development further, Xiaomi has made cache writing free of charge for a limited time across all models, alongside a total fee waiver for the entire MiMo-V2.5-TTS suite, which includes its specialized voice cloning and design features. This pricing logic is clearly designed to accelerate the transition from simple chat applications to persistent, long-horizon agents that can operate at a fraction of the cost of legacy frontier models. Xiaomi has also introduced an overhauled version of its subscription offerings, called the "Token Plan," now available in four levels: The Lite "Starter Pack" provides 720 million credits for $63.36 USD per year Standard tier offers 2.4 billion credits for $168.96 per year A Pro tier provides 8.4 billion credits for $528.00 per year (designed for enterprise use cases) Max —aimed at high-intensity coding enthusiasts—delivers 19.2 billion credits for $1,056.00 per year Beyond credit allotments, all plans include preferential API rates, a 20% discount for off-peak calls, and "Day-0" support for popular coding scaffolds like Cursor, Zed, and Claude Code. However, both through the API and via the Token Plan, accessing the Xiaomi models from China may present barriers or additional compliance and regulatory risks to U.S.-based enterprise customers. As such, the best bet for U.S. enterprises concerned about relying on Chinese tech but wanting to take advantage of the low cost and open source models is likely setting up their own virtual private clouds or local servers, downloading the model weights, and running the models domestically. MoE architecture but divergent training regimens for V2.5 and V2.5-Pro At the heart of MiMo-V2.5 is a Sparse Mixture-of-Experts (MoE) architecture. While the model boasts a total of 310 billion parameters, only 15 billion are "active" during any given inference cycle. Meanwhile, V2.5-Pro is 1.02 trilion-parameter Mixture-of-Experts model with 42 billion active parameters. In either case, the design functions much like a specialized research hospital: while the facility has hundreds of doctors (parameters), only the specific specialists required for a particular case (query) are called into the room. This massive increase in parameter volume for the Pro version provides the "neural capacity" required for the deep, multi-step reasoning found in complex software engineering and long-horizon tasks, as though even more specialists are available in an even larger hospital. According to Xiaomi's blog post, the regular V2.5 follows a rigorous five-stage evolution: Text Pre-training: Building a massive language backbone on 48 trillion tokens. Projector Warmup: Aligning in-house audio and visual encoders with the language core. Multimodal Pre-training: Scaling across high-quality cross-modal data. Agentic Post-training: Progressively extending the context window from 32K to 1M tokens. RL and MOPD: Utilizing Reinforcement Learning and Multimodal Preference Optimization (MOPD) to sharpen real-world reasoning and perception. The backbone utilizes a hybrid sliding-window attention architecture, inherited from MiMo-V2-Flash, which optimizes how the model "remembers" long-range information. This technical foundation enables MiMo-V2.5 to see, hear, and reason natively, rather than relying on external "plug-in" tools for visual or auditory processing. Conversely, the training of MiMo-V2.5-Pro prioritizes "action space" over sensory perception. Instead of sensory alignment, the Pro model’s training focus shifts toward scaling post-training compute. This process is designed to instill "harness awareness," where the model is specifically trained to manage its own memory and context within autonomous agent scaffolds like Claude Code or OpenCode. While the base V2.5 model is trained to reason across modalities, the Pro version is trained to sustain coherence across more than a thousand sequential tool calls. The standard V2.5 model balances local and global attention to maintain multimodal perception. The Pro model, however, utilizes an increased hybrid attention ratio—evolving from the 5:1 ratio of previous generations to a more aggressive 7:1 ratio. This allows the Pro model to "skim" the vast majority of its context while applying high-density attention to the specific 15% of data most relevant to its current objective, a critical feature for debugging large repositories or optimizing graduate-level circuits. Finally, while both models undergo Reinforcement Learning (RL) and Multimodal Preference Optimization (MOPD), the objectives of these stages differ. For MiMo-V2.5, the RL stage is used to sharpen perception and multimodal reasoning. For MiMo-V2.5-Pro, RL is focused on instruction following within agentic scenarios, ensuring the model adheres to subtle requirements embedded deep within ultra-long contexts and recovers gracefully from errors during autonomous execution. This results in the Pro model's "self-correcting" discipline, as seen in its ability to diagnose and fix regressions during the 4.3-hour SysY compiler build. Full MIT License is perfect for enterprise use cases In a move that distinguishes it from many "open" models that include restrictive "Acceptable Use" policies, Xiaomi has released MiMo-V2.5 under the MIT License.The MIT License is the gold standard of permissive software licensing. For developers and enterprises, this means: No Authorization Required: Companies can deploy the model commercially without seeking explicit permission from Xiaomi. Continued Training: Developers are free to fine-tune the model on proprietary data and even release those derivative weights. Unrestricted Commercial Use: There are no revenue caps or user-base limits that often plague "community" licenses. By choosing MIT over a custom "open weights" license, Xiaomi is positioning MiMo as the foundational infrastructure for the next generation of AI agents, effectively inviting the global developer community to treat the model as a public utility. Xiaomi's background: from smartphones and EVs to Chinese open source AI darling Xiaomi’s pivot toward frontier AI agents is the logical culmination of a decade spent building one of the world's most dense hardware-software flywheels. Founded in 2010 as a smartphone disruptor, the Beijing-based company has executed a high-stakes transition into a vertically integrated powerhouse defined by its "Human x Car x Home" strategy. This ecosystem now encompasses over 823 million connectable smart devices unified under the HyperOS architecture. The company’s 2024 entry into the automotive sector with the SU7 and the subsequent high-performance YU7 SUV served as a proof of concept for this integration, positioning Xiaomi as a direct competitor to global luxury marques. By investing 200 billion yuan ($29B USD) into foundational R&D for chips and operating systems, Xiaomi has moved beyond consumer electronics assembly; it has become an architect of the "action space," using its massive hardware footprint as the primary testing ground for the agentic intelligence found in the MiMo-V2.5 series. Ecosystem support The release has been met with immediate "Day-0" support from the broader AI ecosystem. The MiMo team announced that SGLang and vLLM—two of the most popular high-throughput inference engines—supported the V2.5 series at launch. This was made possible through hardware partnerships with AWS, AMD, T-HEAD, and Enflame, ensuring the model can run efficiently on everything from cloud-based H100s to domestic Chinese accelerators. Fuli Luo, the project lead at Xiaomi MiMo and a former key member of the DeepSeek team, underscored the philosophy behind the release on X (formerly Twitter): "A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now!" To kickstart this building phase, Luo announced a 100-trillion free token grant for builders and creators. This massive incentive is designed to lower the barrier to entry for developers who want to experiment with the 1M context window without immediate financial risk. The economic realignment: open source vs. metered proprietary The launch arrives at a critical juncture for AI economics. The shift toward usage-based billing marks the definitive end of the "all-you-can-eat" buffet era for AI services, a trend underscored by GitHub’s announcement today that its AI coding assistant Github Copilot will transition all plans to metered, token-based credits. As seat-based predictability gives way to consumption-driven costs, premium agentic workflows—which can consume millions of tokens in a single reasoning session—are becoming increasingly difficult for enterprises to budget. User sentiment has turned predictably cynical, with developers lamenting that they will "get less, but pay the same price" as subscriptions convert into finite allotments. This pricing evolution significantly enhances the strategic appeal of the MiMo series. By releasing under a permissive MIT License, Xiaomi allows organizations to bypass the escalating "SaaS tax" and reclaim financial predictability through private deployment. Crucially, Xiaomi has eliminated the "context tax" for its API. The 1-million-token context window is now billed at the standard rate—1 token = 1 credit for V2.5 and 2 credits for the Pro version—with no additional multiplier. This stands in stark contrast to the industry-wide move toward session-based caps, positioning MiMo as a refuge for cost-sensitive, high-volume development. Analysis for enterprises The launch of MiMo-V2.5 is more than just a weight drop; it is a declaration of independence for the open-source community. By matching Claude Sonnet 4.6 in multimodal agentic work and Gemini 3 Pro in video understanding, Xiaomi has proven that the gap between "closed-door" labs and open research is effectively closed. With the MIT license as a catalyst and a 100T token grant as fuel, the coming months will likely see a surge in specialized, agentic applications built on the MiMo backbone. Confirming the project's ambitious trajectory, the team noted they are already training the next generation, focusing on "deeper reasoning" and "richer real-world grounding". For now, MiMo-V2.5 stands as a testament to the power of sparse architectures and permissive licensing in the race toward functional AGI.
- [P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellGoogle DeepMind dropped Gemma 4 today: Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context Both are natively multimodal (text, image, video, dynamic resolution). We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful). Free playground if you want to test without spinning anything up: https://www.modular.com/#playground submitted by /u/carolinedfrasca [link] [comments]