An architectural deep dive into the new frontier of AI acceleration — from systolic arrays and spatial dataflow to the GPU vs TPU vs NPU debate reshaping mobile, automotive, and data-center inference.
Published · 11 June 2026Division · OCXLY AI LabsAudience · Engineers, Architects, PMsCitations · 24
Reading mode
Neural Processing Units (NPUs) represent a specialised class of hardware accelerators designed to expedite the computationally intensive tasks inherent in artificial intelligence, particularly deep-learning models[1]. As AI workloads have grown in complexity and scale, the demand for efficient execution has driven innovation well beyond traditional CPUs.
While GPUs established themselves as the dominant force in accelerating neural-network computations through massively parallel Single Instruction, Multiple Data (SIMD) architectures, NPUs offer a more focused approach by tailoring their design specifically to the matrix multiplications and convolutions that define modern AI models[2].
The defining characteristic of an NPU is its specialisation: it is engineered to execute tensor operations with significantly higher performance-per-watt than general-purpose processors. That efficiency makes NPUs indispensable for power-constrained environments — phones, laptops, edge devices — while their high throughput also positions them as viable alternatives to GPUs for large-scale inference in the data center[3].
The NPU ecosystem is diverse, ranging from highly optimised systolic arrays in Google's TPUs to flexible, programmable dataflow architectures like AMD's XDNA™. Unlike CPUs and GPUs, which benefit from decades of standardised ISAs, NPUs depend on proprietary SDKs and compiler toolchains to translate models from TensorFlow, PyTorch, and ONNX into hardware-specific code[4]. The following sections examine the architectural principles, real-world applications, market dynamics, and comparative standing of NPUs within the broader AI-acceleration landscape.
01
Architectural foundations of NPU design
The compute fabric
An NPU's architecture is fundamentally defined by its mission: accelerate the linear algebra that constitutes the majority of work in deep neural networks. Where CPUs prioritise complex logic and branching, and GPUs offer broad SIMD parallelism, NPUs streamline their design to excel at a narrower set of tasks — primarily multiply-accumulate (MAC) operations on large matrices[5].
The most influential pattern for this purpose is the systolic array: a grid of simple processing elements, each containing a MAC unit, arranged so that data flows through the array in a synchronised, pipelined fashion. As weights and activations stream through the grid, each PE multiplies, accumulates, and propagates the result onward — drastically reducing the on-chip and off-chip memory accesses that dominate latency and energy[6]. Google's TPUs are the canonical example, leveraging a massive static systolic array as their central compute engine[7].
From rigid arrays to spatial dataflow
In response to the inflexibility of purely systolic designs, newer NPUs are exploring spatial dataflow. AMD's XDNA™ is the prime example: instead of a single monolithic array, it employs a tiled array of smaller, more versatile AI Engines, each combining vector and scalar processors to allow richer control flow[8]. This programmable fabric lets developers implement dataflow patterns tailored to the structure of each neural-network layer, aiming to combine TPU-level efficiency with GPU-level versatility.
At a lower level, adaptive precision techniques allow the array to shift between INT8, FP8, FP4, and higher-precision modes on the fly — saving power where possible and reserving accuracy only where needed[9].
Memory hierarchy and system integration
Modern NPUs are usually integrated into larger SoCs alongside CPUs and GPUs. A notable trend is the unified memory pool — shared between all three units, as in AMD's Ryzen AI Max+ series — which eliminates expensive copies between isolated memory spaces[10]. Near-memory designs like NXP's eIQ Neutron place compute units directly inside memory banks, while NeuMMU-style address translation lets the NPU operate on virtual addresses, abstracting physical layout from the developer. Yet the memory wall — the widening gap between compute and bandwidth — remains the biggest bottleneck in AI hardware design[11].
The software gap
Unlike CPUs and GPUs, NPUs lack a widely adopted open ISA. Each vendor ships its own SDK — Qualcomm Hexagon, MediaTek RKNN, Apple CoreML, AMD XDNA toolchain — to translate framework models into executable code[12]. This fragmentation hurts portability, and emerging projects like MLIR-AIR aim to provide a standardised intermediate representation that maps efficiently across diverse NPU fabrics[13]. The lesson: an NPU's real-world performance is inseparable from the quality of its compiler and runtime.
02
Real-world applications and market penetration
Smartphones and AI PCs
The smartphone was the first mass-market domain to embrace NPUs. Apple's introduction of the Neural Engine in the A11 Bionic in 2017 was the pivotal moment — accelerating Face ID, Siri, and later Apple Intelligence with on-device privacy[14]. Today, Qualcomm's Snapdragon 8 Gen 3 can run Stable Diffusion entirely offline in under a second[15]. On the PC side, AMD's Ryzen AI PRO and Ryzen AI MAX+ families integrate XDNA™ 2 NPUs offering up to 60 TOPS for real-time translation, background blur, and on-device generative features. AI-advanced PCs are projected to exceed half of global shipments by 2026.
Automotive and ADAS
Advanced driver-assistance systems require immense compute for real-time perception, sensor fusion, and decision-making. Qualcomm's Snapdragon Ride platform delivers 36–100 TOPS, and industry projections suggest a single vehicle may require more than 5,000 TOPS by 2030 to enable full autonomy[16]. NPUs are embedded directly into automotive SoCs to process camera, LiDAR, and radar streams in milliseconds.
Industrial IoT and the edge
The embedded-AI market is forecast to grow from $13.49 B in 2026 to $48.90 B by 2034 (CAGR 17.5%), and the broader Edge-AI market from $24.91 B in 2025 to $118.69 B by 2033 (CAGR 21.7%)[17]. NVIDIA's Jetson platform already captures 39% of edge-AI revenue, with the Jetson AGX Orin module delivering up to 275 TOPS. MediaTek's Genio targets drones, robots, and commercial IoT.
Cloud and data center
GPUs still dominate training, but their power draw is a liability for always-on inference. Microsoft's Maia 200, co-designed with OpenAI, is explicitly positioned for inference efficiency with native FP8/FP4 tensor cores[18]. Intel's Gaudi 3 has reported LLM-inference parity (and in some tests, up to a 30% lead) versus NVIDIA's H100, while AWS Inferentia continues to optimise cost-per-inference inside EC2[19].
35–70%
Power reduction reported by NPU-based servers versus equivalent GPU inference workloads, while matching or exceeding throughput.[20]
03
Comparative analysis: GPU vs TPU vs NPU
GPUs — originally built for graphics — became the deep-learning default thanks to massive SIMD parallelism and the maturity of CUDA. Their strength is versatility and ecosystem; their weakness is power efficiency for the narrow linear-algebra patterns that dominate neural networks.
TPUs represent radical specialisation. A custom ASIC organised around a large static systolic array, the TPU excels at high-throughput matrix multiplication — particularly inference — and scales effectively within Google's supercomputers. The trade-off is tight coupling to Google's stack and reduced flexibility outside that environment[21].
NPUs occupy the middle ground, defined less by a single architecture than by a focus on performance-per-watt inside SoCs. NPU-based inference servers consistently match or exceed GPU throughput while consuming 35–70% less power[20]. Modern designs like XDNA™ further blur the line, offering tiled, programmable AI Engines that approach GPU-style flexibility without surrendering NPU efficiency.
GPU
Versatile SIMD, mature CUDA ecosystem, dominant for training and broad inference. Moderate power efficiency. Examples: NVIDIA H100, RTX series.
TPU
Specialised systolic-array ASIC. Peak efficiency on matrix ops, tightly coupled to Google's stack. Examples: TPU v1–v5.
NPU
SoC-integrated, optimised for performance-per-watt across mobile, edge, and increasingly inference at scale. Examples: Apple Neural Engine, AMD XDNA, Qualcomm Hexagon.
Ultimately, the choice is workload-driven. For novel-architecture training, the GPU still wins on flexibility. For hyperscale, stable production inference, TPUs and modern NPUs compete on cost-per-inference. For nearly all on-device AI — unlocking a phone, driving an assistance system, running an LLM at the edge — the NPU's joint optimisation of latency, throughput, and power is the only viable answer.
04
Metrics, benchmarks, and emerging trends
Beyond TOPS
TOPS (trillion operations per second) is the most quoted NPU metric — AMD Ryzen AI MAX+ 395 advertises 50+ peak AI TOPS; Jetson AGX Orin reaches 275 TOPS[22]. But TOPS ignores memory bandwidth, latency, power, and the efficiency of the compiler stack. Two NPUs at identical TOPS can deliver vastly different real-world performance.
The honest assessment comes from MLPerf — standardised training and inference benchmarks across representative models — and from MLPerf Power, which captures energy consumed per unit of work[23]. As data-center electricity costs rise and sustainability metrics tighten, performance-per-watt is becoming as important as peak throughput.
What's next
Dedicated operator accelerators. Most NPUs are optimised for dense matrix multiplication, but transformer-era workloads spend significant time in attention and softmax. Specialised hardware for these specific layers is emerging as a meaningful performance lever[24].
In-memory compute. Performing operations directly inside memory arrays attacks the memory wall head-on. Eliminating data movement promises substantial latency and energy gains for next-generation accelerators.
Trusted execution environments (TEEs) on the NPU. As AI handles increasingly sensitive data, isolating computation and model weights inside protected hardware regions becomes essential for privacy, IP protection, and regulatory compliance.
Software unification. The lack of a universal NPU ISA remains the field's biggest friction point. Higher-level abstractions like MLIR-AIR are the most promising path toward a portable NPU programming model, and any vendor that ships truly developer-friendly tooling stands to gain disproportionate market share.
·
The bigger picture
NPUs are no longer a niche mobile accelerator — they are the foundation of an AI-hardware tier defined by joint optimisation of throughput, latency, and watts. The next decade will not be a GPU-vs-TPU-vs-NPU zero-sum battle; each architecture serves a distinct workload zone, and the most interesting designs are the ones that borrow from all three.
For builders: pick the accelerator by workload, not by brand. For investors: the gravity is moving toward inference efficiency and developer experience. For everyone else: the AI you use on-device tomorrow will almost certainly be running on an NPU — and the architectural choices made today will shape what that experience feels like.
Maia 200: The AI accelerator built for inference. Microsoft Blog. blogs.microsoft.com
AI Accelerators for Large Language Model Inference. arXiv 2506.00008. arxiv.org
Unlocking the AMD Neural Processing Unit for ML Training. arXiv 2504.03083. arxiv.org
A Survey on Deep Learning Hardware Accelerators. arXiv 2306.15552. arxiv.org
Exploring Edge AI Performance with NVIDIA Jetson Orin NX. Dell Community.
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv 1907.10701. ar5iv.labs.arxiv.org
Lightweight and Energy-Efficient Deep Learning Accelerator for Real-Time Object Detection on Edge Devices. ResearchGate.
We value your privacy
We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. Essential cookies are always on. You can accept all, reject non-essential, or customize your choices.
Cookie Policy
Privacy Preferences
Choose which categories of cookies and tracking you allow. You can change these settings at any time from the footer.
Strictly Necessary Always Active
Required for the site to function — page navigation, secure areas, and remembering your consent choice. Cannot be disabled.
Analytics & Performance
Anonymous Google Analytics data so we can understand how the site is used and improve it. No personal identifiers are stored.
Functional
Remembers preferences such as language and region to give a more personalized experience on return visits.
Marketing & Personalization
Used to measure ad effectiveness and show content relevant to your interests. Linked to the CCPA "Do Not Sell or Share" right — disable this to opt out.