Microsoft details Maia 200 inference accelerator

—

January 29, 2026

Microsoft details Maia 200 inference accelerator

Microsoft has detailed Maia 200, its latest inference accelerator silicon. Built on TSMC 3nm, the chip pairs FP8/FP4 tensor cores with 216GB HBM3e, targeting higher token-throughput per watt in Azure-scale deployments.

IN Brief:

Maia 200 is Microsoft’s new inference-focused accelerator, fabricated on TSMC’s 3nm process.
Key hardware claims include FP8/FP4 tensor cores, 216GB HBM3e at 7 TB/s, and 272MB on-chip SRAM.
Microsoft says initial deployment is in Azure’s US Central region, with a Maia SDK preview offering PyTorch integration and a Triton compiler.

Microsoft has unveiled technical detail on Maia 200, its latest in-house AI inference accelerator, and the specifications read like a direct response to the realities of token generation at scale: memory bandwidth, data movement, and system-level networking matter at least as much as peak compute. Maia 200 is built on TSMC’s 3nm process and is designed around low-precision inference workloads, with native FP8 and FP4 tensor processing.

On-package memory is the headline. Microsoft says the redesigned memory system delivers 216GB of HBM3e at 7 TB/s, backed by 272MB of on-chip SRAM, with data movement engines intended to keep large models fed without starving compute. The company also states each chip contains more than 140 billion transistors, and frames performance in practical low-precision terms: over 10 petaFLOPS in FP4 and over 5 petaFLOPS in FP8, within a 750W SoC TDP envelope.

Those are big numbers, but the more telling engineering choices sit in the system design. Microsoft describes a two-tier scale-up network built on standard Ethernet, with a custom transport layer and integrated NIC intended to avoid reliance on proprietary fabrics. It claims 2.8 TB/s of bidirectional dedicated scale-up bandwidth per accelerator and predictable collective operations across clusters scaling up to 6,144 accelerators, with four accelerators fully connected inside each tray to keep local traffic local.

Deployment detail matters for engineers trying to map availability to real design programmes. Microsoft says Maia 200 is already deployed in its US Central datacentre region near Des Moines, Iowa, with the US West 3 region near Phoenix, Arizona, next, and more regions to follow. Alongside the hardware, Microsoft is previewing a Maia SDK that it says includes PyTorch integration, a Triton compiler, an optimised kernel library, and access to a low-level programming language, with the aim of making model porting across heterogeneous accelerator fleets less painful.

Maia 200 is another sign that hyperscalers are no longer content to optimise around one vendor’s roadmap, and that the power, cooling, and networking constraints of AI infrastructure are now shaping silicon microarchitecture in plain view.

If Microsoft’s Ethernet-based scale-up approach delivers the utilisation it’s promising, it will pressure the wider ecosystem — not only on raw performance, but on how quickly a platform can be made usable by software teams who have no appetite for bespoke toolchains that die after one generation.

Microsoft details Maia 200 inference accelerator

IN Brief:

Stories for you

Rochester extends Neuron processor supply beyond end of life

Molex moves for Teramount in CPO push

AMD deepens French AI partnership around Alice Recoque

Infineon rad-hard semiconductors complete Artemis II mission

ADIOS lands first £1m order as stocked component portfolio expands

TDK-Lambda introduces 1500W AC-DC modules for industrial systems

GE pushes open TSN backbone for vertical lift

NVIDIA opens Ising models for quantum workflows

AliveCor takes handheld 12-lead ECG into Europe