AI Labs · Research · 22 min read

The Rise of the NPU

An architectural deep dive into the new frontier of AI acceleration — from systolic arrays and spatial dataflow to the GPU vs TPU vs NPU debate reshaping mobile, automotive, and data-center inference.

Published · 11 June 2026 Division · OCXLY AI Labs Audience · Engineers, Architects, PMs Citations · 24

Reading mode

The rise of the NPU: AI acceleration architecture

Key takeaways

NPUs are purpose-built for the matrix and convolution math behind deep learning, optimising performance-per-watt rather than raw versatility.
Systolic arrays (Google TPU) maximise throughput; spatial dataflow tiles (AMD XDNA™) trade some efficiency for flexibility.
Unified memory, near-memory compute, and virtual-address support are easing the data-movement bottleneck inside SoCs.
Software fragmentation — no common ISA, vendor-specific SDKs — is the single biggest barrier to NPU portability.
Edge AI is exploding: NPUs now power smartphones, ADAS, IIoT, and increasingly hyperscale inference (Maia 200, Gaudi 3, Inferentia).
TOPS alone is misleading — MLPerf and MLPerf Power give a far more honest read on real workloads.
What's next: dedicated attention/softmax accelerators, in-memory compute, on-chip TEEs, and MLIR-based unification.

Neural Processing Units (NPUs) represent a specialised class of hardware accelerators designed to expedite the computationally intensive tasks inherent in artificial intelligence, particularly deep-learning models^[1]. As AI workloads have grown in complexity and scale, the demand for efficient execution has driven innovation well beyond traditional CPUs.

While GPUs established themselves as the dominant force in accelerating neural-network computations through massively parallel Single Instruction, Multiple Data (SIMD) architectures, NPUs offer a more focused approach by tailoring their design specifically to the matrix multiplications and convolutions that define modern AI models^[2].

The defining characteristic of an NPU is its specialisation: it is engineered to execute tensor operations with significantly higher performance-per-watt than general-purpose processors. That efficiency makes NPUs indispensable for power-constrained environments — phones, laptops, edge devices — while their high throughput also positions them as viable alternatives to GPUs for large-scale inference in the data center^[3].

The NPU ecosystem is diverse, ranging from highly optimised systolic arrays in Google's TPUs to flexible, programmable dataflow architectures like AMD's XDNA™. Unlike CPUs and GPUs, which benefit from decades of standardised ISAs, NPUs depend on proprietary SDKs and compiler toolchains to translate models from TensorFlow, PyTorch, and ONNX into hardware-specific code^[4]. The following sections examine the architectural principles, real-world applications, market dynamics, and comparative standing of NPUs within the broader AI-acceleration landscape.

Architectural foundations of NPU design

The compute fabric

An NPU's architecture is fundamentally defined by its mission: accelerate the linear algebra that constitutes the majority of work in deep neural networks. Where CPUs prioritise complex logic and branching, and GPUs offer broad SIMD parallelism, NPUs streamline their design to excel at a narrower set of tasks — primarily multiply-accumulate (MAC) operations on large matrices^[5].

The most influential pattern for this purpose is the systolic array: a grid of simple processing elements, each containing a MAC unit, arranged so that data flows through the array in a synchronised, pipelined fashion. As weights and activations stream through the grid, each PE multiplies, accumulates, and propagates the result onward — drastically reducing the on-chip and off-chip memory accesses that dominate latency and energy^[6]. Google's TPUs are the canonical example, leveraging a massive static systolic array as their central compute engine^[7].

From rigid arrays to spatial dataflow

In response to the inflexibility of purely systolic designs, newer NPUs are exploring spatial dataflow. AMD's XDNA™ is the prime example: instead of a single monolithic array, it employs a tiled array of smaller, more versatile AI Engines, each combining vector and scalar processors to allow richer control flow^[8]. This programmable fabric lets developers implement dataflow patterns tailored to the structure of each neural-network layer, aiming to combine TPU-level efficiency with GPU-level versatility.

At a lower level, adaptive precision techniques allow the array to shift between INT8, FP8, FP4, and higher-precision modes on the fly — saving power where possible and reserving accuracy only where needed^[9].

Memory hierarchy and system integration

Modern NPUs are usually integrated into larger SoCs alongside CPUs and GPUs. A notable trend is the unified memory pool — shared between all three units, as in AMD's Ryzen AI Max+ series — which eliminates expensive copies between isolated memory spaces^[10]. Near-memory designs like NXP's eIQ Neutron place compute units directly inside memory banks, while NeuMMU-style address translation lets the NPU operate on virtual addresses, abstracting physical layout from the developer. Yet the memory wall — the widening gap between compute and bandwidth — remains the biggest bottleneck in AI hardware design^[11].

The software gap

Unlike CPUs and GPUs, NPUs lack a widely adopted open ISA. Each vendor ships its own SDK — Qualcomm Hexagon, MediaTek RKNN, Apple CoreML, AMD XDNA toolchain — to translate framework models into executable code^[12]. This fragmentation hurts portability, and emerging projects like MLIR-AIR aim to provide a standardised intermediate representation that maps efficiently across diverse NPU fabrics^[13]. The lesson: an NPU's real-world performance is inseparable from the quality of its compiler and runtime.

Real-world applications and market penetration

Smartphones and AI PCs

The smartphone was the first mass-market domain to embrace NPUs. Apple's introduction of the Neural Engine in the A11 Bionic in 2017 was the pivotal moment — accelerating Face ID, Siri, and later Apple Intelligence with on-device privacy^[14]. Today, Qualcomm's Snapdragon 8 Gen 3 can run Stable Diffusion entirely offline in under a second^[15]. On the PC side, AMD's Ryzen AI PRO and Ryzen AI MAX+ families integrate XDNA™ 2 NPUs offering up to 60 TOPS for real-time translation, background blur, and on-device generative features. AI-advanced PCs are projected to exceed half of global shipments by 2026.

Automotive and ADAS

Advanced driver-assistance systems require immense compute for real-time perception, sensor fusion, and decision-making. Qualcomm's Snapdragon Ride platform delivers 36–100 TOPS, and industry projections suggest a single vehicle may require more than 5,000 TOPS by 2030 to enable full autonomy^[16]. NPUs are embedded directly into automotive SoCs to process camera, LiDAR, and radar streams in milliseconds.

Industrial IoT and the edge

The embedded-AI market is forecast to grow from $13.49 B in 2026 to $48.90 B by 2034 (CAGR 17.5%), and the broader Edge-AI market from $24.91 B in 2025 to $118.69 B by 2033 (CAGR 21.7%)^[17]. NVIDIA's Jetson platform already captures 39% of edge-AI revenue, with the Jetson AGX Orin module delivering up to 275 TOPS. MediaTek's Genio targets drones, robots, and commercial IoT.

Cloud and data center

GPUs still dominate training, but their power draw is a liability for always-on inference. Microsoft's Maia 200, co-designed with OpenAI, is explicitly positioned for inference efficiency with native FP8/FP4 tensor cores^[18]. Intel's Gaudi 3 has reported LLM-inference parity (and in some tests, up to a 30% lead) versus NVIDIA's H100, while AWS Inferentia continues to optimise cost-per-inference inside EC2^[19].

35–70%

Power reduction reported by NPU-based servers versus equivalent GPU inference workloads, while matching or exceeding throughput.^[20]

Comparative analysis: GPU vs TPU vs NPU

GPUs — originally built for graphics — became the deep-learning default thanks to massive SIMD parallelism and the maturity of CUDA. Their strength is versatility and ecosystem; their weakness is power efficiency for the narrow linear-algebra patterns that dominate neural networks.

TPUs represent radical specialisation. A custom ASIC organised around a large static systolic array, the TPU excels at high-throughput matrix multiplication — particularly inference — and scales effectively within Google's supercomputers. The trade-off is tight coupling to Google's stack and reduced flexibility outside that environment^[21].

NPUs occupy the middle ground, defined less by a single architecture than by a focus on performance-per-watt inside SoCs. NPU-based inference servers consistently match or exceed GPU throughput while consuming 35–70% less power^[20]. Modern designs like XDNA™ further blur the line, offering tiled, programmable AI Engines that approach GPU-style flexibility without surrendering NPU efficiency.

GPU

Versatile SIMD, mature CUDA ecosystem, dominant for training and broad inference. Moderate power efficiency. Examples: NVIDIA H100, RTX series.

TPU

Specialised systolic-array ASIC. Peak efficiency on matrix ops, tightly coupled to Google's stack. Examples: TPU v1–v5.

NPU

SoC-integrated, optimised for performance-per-watt across mobile, edge, and increasingly inference at scale. Examples: Apple Neural Engine, AMD XDNA, Qualcomm Hexagon.

Ultimately, the choice is workload-driven. For novel-architecture training, the GPU still wins on flexibility. For hyperscale, stable production inference, TPUs and modern NPUs compete on cost-per-inference. For nearly all on-device AI — unlocking a phone, driving an assistance system, running an LLM at the edge — the NPU's joint optimisation of latency, throughput, and power is the only viable answer.

Metrics, benchmarks, and emerging trends

Beyond TOPS

TOPS (trillion operations per second) is the most quoted NPU metric — AMD Ryzen AI MAX+ 395 advertises 50+ peak AI TOPS; Jetson AGX Orin reaches 275 TOPS^[22]. But TOPS ignores memory bandwidth, latency, power, and the efficiency of the compiler stack. Two NPUs at identical TOPS can deliver vastly different real-world performance.

The honest assessment comes from MLPerf — standardised training and inference benchmarks across representative models — and from MLPerf Power, which captures energy consumed per unit of work^[23]. As data-center electricity costs rise and sustainability metrics tighten, performance-per-watt is becoming as important as peak throughput.

What's next

Dedicated operator accelerators. Most NPUs are optimised for dense matrix multiplication, but transformer-era workloads spend significant time in attention and softmax. Specialised hardware for these specific layers is emerging as a meaningful performance lever^[24].

In-memory compute. Performing operations directly inside memory arrays attacks the memory wall head-on. Eliminating data movement promises substantial latency and energy gains for next-generation accelerators.

Trusted execution environments (TEEs) on the NPU. As AI handles increasingly sensitive data, isolating computation and model weights inside protected hardware regions becomes essential for privacy, IP protection, and regulatory compliance.

Software unification. The lack of a universal NPU ISA remains the field's biggest friction point. Higher-level abstractions like MLIR-AIR are the most promising path toward a portable NPU programming model, and any vendor that ships truly developer-friendly tooling stands to gain disproportionate market share.

The bigger picture

NPUs are no longer a niche mobile accelerator — they are the foundation of an AI-hardware tier defined by joint optimisation of throughput, latency, and watts. The next decade will not be a GPU-vs-TPU-vs-NPU zero-sum battle; each architecture serves a distinct workload zone, and the most interesting designs are the ones that borrow from all three.

For builders: pick the accelerator by workload, not by brand. For investors: the gravity is moving toward inference efficiency and developer experience. For everyone else: the AI you use on-device tomorrow will almost certainly be running on an NPU — and the architectural choices made today will shape what that experience feels like.

References

Hardware for Deep Learning Acceleration. Wiley Online Library. advanced.onlinelibrary.wiley.com
Architecture of AI Accelerators. ResearchGate. researchgate.net
Performance and Efficiency Gains of NPU-Based Servers over GPU-Based Servers. MDPI Systems, 13(9), 797. mdpi.com
PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework. ACM. dl.acm.org
Explained: CPU, GPU, TPU, NPU, LPU AI Hardware Architectures. LinkedIn / Avi Chawla.
Systolic Array Data Flows for Efficient Matrix Multiplication in Deep Learning. arXiv 2410.22595. arxiv.org
Google TPUv3 Lead on Building Better AI Chips with Systolic Arrays. LinkedIn.
AMD XDNA™ Architecture. amd.com/en/technologies/xdna
Designing an Elegant and Reliable BFP-Based NPU. arXiv 2604.10494. arxiv.org
AMD Ryzen™ AI MAX+ 395 Processor. amd.com
Strix: Re-thinking NPU Reliability from a System Perspective. arXiv 2604.10484. arxiv.org
Qualcomm Hexagon NPU SDK / MediaTek RKNN SDK — vendor toolchains for on-device inference deployment.
MLIR-AIR: a unifying intermediate representation for spatial NPU programming.
Deploying Transformers on the Apple Neural Engine. Apple Machine Learning Research. machinelearning.apple.com
Qualcomm Snapdragon 8 Gen 3: On-Device Generative AI. Counterpoint Research. counterpointresearch.com
Intel® AI for Enterprise Inference and automotive autonomy projections.
Edge AI Market Size, Share & Trends. Grand View Research. grandviewresearch.com
Maia 200: The AI accelerator built for inference. Microsoft Blog. blogs.microsoft.com
AI Accelerators for Large Language Model Inference. arXiv 2506.00008. arxiv.org
Unlocking the AMD Neural Processing Unit for ML Training. arXiv 2504.03083. arxiv.org
A Survey on Deep Learning Hardware Accelerators. arXiv 2306.15552. arxiv.org
Exploring Edge AI Performance with NVIDIA Jetson Orin NX. Dell Community.
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv 1907.10701. ar5iv.labs.arxiv.org
Lightweight and Energy-Efficient Deep Learning Accelerator for Real-Time Object Detection on Edge Devices. ResearchGate.