Google DeepMind released Google Gemma 4 on 2 April 2026 under a fully permissive Apache 2.0 license — a policy change that matters more than any benchmark number in the announcement. For enterprise teams that evaluated earlier Gemma versions and cleared the technical bar only to fail legal review, the license shift is the story. A capable, multimodal, self-hosted model with no royalties, no usage caps, and no vendor approval requirements is a different procurement conversation than anything Google has offered before.
What Changed From Gemma 3 to Gemma 4
Gemma 1, 2, and 3 all shipped under Google’s custom Gemma Terms of Use. That license prohibited certain commercial deployments, required Google’s approval for high-volume commercial use, and complicated redistribution in third-party products — three conditions that regularly triggered fail verdicts from enterprise legal and procurement teams, regardless of how well the model performed technically.
Gemma 4 replaces that entirely with Apache 2.0. Apache 2.0 permits commercial use, modification, distribution, and sublicensing without royalties or usage caps. Enterprise procurement teams already have approved templates for it — the same templates they use for Mistral AI models and most of the open-source toolchain. The friction disappears at the legal layer, not the engineering one.
The Model Family
Gemma 4 ships as four variants. The E2B and E4B are Mixture-of-Experts (MoE) architectures at 2B and 4B parameters respectively, targeting edge and on-device use cases. The 26B MoE is the primary candidate for locally-runnable agentic pipeline deployments. The 31B Dense is the highest-parameter Dense variant in the family, which Google positions for server-side workloads requiring maximum reasoning depth per its published model card.
All four models are natively multimodal, accepting text, images, audio, and video inputs with a 256K context window. For product teams currently running separate model integrations for different input modalities, the ability to route all of them to a single self-hosted endpoint has real operational weight. It reduces the number of inference services to maintain, monitor, and version.
Google DeepMind states that Gemma 4 is built from the same research base as Gemini 3, which positions it as the most technically capable generation of the open-weight line. Independent third-party benchmark comparisons against Llama 4 and Mistral Large at comparable parameter counts have not yet been published as of this writing — performance claims are currently from Google’s own evaluations.
What This Changes for Infrastructure Teams
The more consequential frame for CTOs evaluating AI infrastructure is the unit economics shift. Frontier-class reasoning at the 26B MoE scale, self-hosted, with no per-call cost and no API dependency, changes the calculus for agentic workflows that run at high call volumes.
An agentic pipeline that makes thousands of model calls per complex task — planning, tool use, reflection, verification — can become expensive quickly when priced at API rates. A locally-run 26B MoE model, on hardware the team already owns or rents, converts that variable cost to a fixed infrastructure cost. This economics shift assumes the team already has or can economically acquire compatible GPU hardware — teams needing to procure new A100/H100-class infrastructure specifically for Gemma 4 should model total cost of ownership before assuming API parity. The Hugging Face welcome post covers quantisation options and deployment patterns for both server and on-device targets, which is the starting point for teams scoping hardware requirements.
For platform builders and independent software vendors (ISVs), Apache 2.0 also removes the redistribution problem. Embedding Gemma 4 into a product shipped to customers no longer requires navigating Google’s approval process or advising customers to review a non-standard license. This matters most for smaller teams whose legal resources are thin. Under the prior license, redistributing Gemma in a commercial product required negotiating separate terms with Google — a process that stalled deals and added legal overhead that Apache 2.0 eliminates by making the terms uniform and pre-cleared.
What to Watch
The 26B MoE variant will draw the most scrutiny over the next quarter. The practical question is how it benchmarks independently against comparable open-weight models once third-party evaluations appear — Google’s own numbers are a starting point, not a verdict.
The distinction between the MoE and Dense variants at deployment is also worth monitoring. MoE architectures activate only a subset of parameters per inference pass, which can yield lower latency and better hardware efficiency at equivalent parameter counts, but the specifics of memory footprint and batching behaviour for Gemma 4’s architecture have not been detailed in primary sources yet.
Longer term, the Apache 2.0 move should be read as a competitive signal. Meta’s Llama 3 was among the factors that pushed permissive open-weight licensing toward a market standard, alongside Mistral’s MIT-licensed releases — Google’s Apache 2.0 shift confirms that norm is now established. The open-weight model ecosystem is consolidating around permissive licenses, and the practical effect is that enterprises evaluating self-hosted AI now have multiple frontier-class options that clear procurement without custom legal work. The build-vs-buy decision on AI infrastructure is getting easier to make, and faster to execute.
This article was produced with AI assistance and reviewed by the editorial team.
Further reading: enterprise LLM architecture decisions | AI evaluation in production



