Custom Silicon Is Winning Specific Workloads. What That Means for the AI Supply Chain.

3 min read

Key Claim

Every major hyperscaler is now running a significant fraction of AI inference on custom silicon — Apple M-series, Google TPU v5, Meta MTIA, Microsoft Maia — reducing NVIDIA dependency for specific workloads and reshaping the AI supply chain from a single-vendor model to a tiered hardware market.

Key Takeaways

  • Google runs over 90% of its AI inference workloads on TPUs, not NVIDIA GPUs — and has done so for years
  • Meta’s MTIA v2 handles all of its recommendation system inference, representing the highest-volume AI workload in the world by query count
  • Microsoft’s Maia 100 chip, co-designed with OpenAI, is in production in Azure data centres handling GPT-4 class inference
  • Custom silicon reduces per-query inference cost by 30–60% for the specific workloads it was designed for — but requires 3–5 years and $500M–$1B+ in development investment

NVIDIA’s dominance of the AI hardware market is real and not in question for training workloads and general-purpose inference. H100 and B200 GPUs are the default choice for organisations building AI capability without the scale or specialisation to justify custom silicon. But at hyperscaler scale — where inference runs billions of queries per day across defined workload categories — custom silicon has already won significant territory. Understanding where and why is relevant for any organisation thinking about AI infrastructure at multi-year timescales.

Google: The Template That Everyone Followed

Google’s TPU program, which began in 2013 and has now reached its fifth generation, is the longest-running and most mature custom AI silicon effort in the industry. TPU v5, deployed through 2024 and 2025, delivers peak performance of 459 teraflops per chip for bfloat16 matrix operations — the numerical format dominant in large model training and inference. Google has disclosed that the majority of its AI inference workload, including Search, Maps, and YouTube recommendation, runs on TPUs rather than NVIDIA hardware.

The economic logic is straightforward at Google’s scale. A chip designed specifically for transformer inference, without the general-purpose compute overhead that makes NVIDIA GPUs versatile but expensive-per-operation, can deliver the same throughput at lower cost and power consumption. Google’s advantage is not raw performance — H100s match or exceed TPU v5 on several benchmarks — but cost-per-inference-query at the volumes Google operates.

Meta: Recommendation at World Scale

Meta’s MTIA (Meta Training and Inference Accelerator) program produced its second-generation chip in 2024, and it handles the inference load for Meta’s recommendation systems across Facebook, Instagram, and WhatsApp. Recommendation models are among the highest-volume AI workloads in existence — Meta serves billions of users, with each feed refresh triggering thousands of model calls. MTIA v2 is designed specifically for this workload: sparse embedding lookups, relatively small model sizes, but extreme throughput requirements.

Meta has been explicit that MTIA is not a general-purpose AI chip and not intended to compete with NVIDIA for training. It is purpose-built for one workload category, at a scale where the economics of custom silicon development justify the investment. The lesson is not that custom silicon is universally superior — it is that at sufficient scale and workload specificity, purpose-built hardware is economically rational.

Microsoft and the OpenAI Partnership

Microsoft’s Maia 100 chip, announced in late 2023 and in Azure production through 2024–2025, represents a different strategic logic. Microsoft’s primary AI hardware investment has been in NVIDIA — it holds significant long-term supply agreements for H100 and B200 capacity. Maia is not a replacement strategy; it is a hedge and a specialisation tool. Co-designed with OpenAI for the specific numerical characteristics of GPT-4 class models, Maia handles a subset of Azure’s OpenAI inference workload at lower cost than equivalent NVIDIA hardware for that specific task profile.

Amazon’s Trainium 2 and Inferentia 3, deployed across AWS, follow a similar logic — custom silicon for the specific inference characteristics of AWS’s highest-volume workloads, while NVIDIA hardware handles the general-purpose training and less-optimised inference cases.

What This Means for the AI Supply Chain

The aggregate effect of hyperscaler custom silicon investment is a structural bifurcation in the AI hardware market. NVIDIA retains dominance in training, in the enterprise market (which lacks the scale for custom silicon economics), and in general-purpose inference. But for the highest-volume inference workloads in the world — those operated by Google, Meta, Microsoft, and Amazon — NVIDIA’s market share is lower than headline figures suggest, and declining at the margin.

For organisations outside the hyperscaler tier, the practical implication is indirect: custom silicon investment by the major cloud providers improves the cost structure of AI APIs over time. The inferences you call via Google Vertex AI or Azure OpenAI may increasingly be served by custom hardware, reducing the cost passed through to API customers. The NVIDIA supply constraint that dominated 2023–2024 is partly a consequence of the world not having built this bifurcated market fast enough. That is changing.

Source Trail

Google TPU v5 technical documentation · Meta MTIA v2 engineering blog (2024) · Microsoft Maia 100 announcement and Azure deployment updates · AWS Trainium 2 / Inferentia 3 product documentation · SemiAnalysis custom silicon market analysis Q1 2026

Arjun Mehta, AI infrastructure and semiconductors correspondent at Next Waves Insight

About Arjun Mehta

Arjun Mehta covers AI compute infrastructure, semiconductor supply chains, and the hardware economics driving the next wave of AI. He has a background in electrical engineering and spent five years in process integration at a leading semiconductor foundry before moving into technology analysis. He tracks arXiv pre-prints, IEEE publications, and foundry filings to surface developments before they reach the mainstream press.

Meet the team →
Share: 𝕏 in
The NextWave SignalSubscribe free

The NextWave Signal

Enjoyed this analysis?

One AI market analysis + one emerging-tech signal, every Tuesday and Friday — written for engineers, PMs, and CTOs tracking what shifts before it goes mainstream.

Leave a Comment