EulerNPU – NPU Inference Composition & Simulation Stack

3 NPUs Running on Real FPGA

Same compiler · same board (QMTECH XC7Z020, Zynq-7000) — three different NPUs synthesized and validated on real hardware

KWS (Keyword Spotting)

DS-CNN + 16-layer GRU + FC quantized to INT8. Full-graph NPU IP (kws_npu_top) synthesized at 90 MHz @ DSP 45% / BRAM 24%.

Accuracy	11/11 (CPU↔NPU 100% match)
Speedup	8.07× vs CPU (3.17 ms/inference)
Streaming	200 frames in 5.166 s · 38.7 inf/s

xsdb 6-stage boot + UART output: 11/11 accuracy, 8.07× speedup, per-keyword confidence.

CWRU Bearing Fault Detection

Vibration FFT → 3×Conv1D + MaxPool + GAP + 2×FC INT8. bearing_npu_top IP at 90 MHz @ DSP 86% / BRAM 12%.

Accuracy	40/40 (CPU↔NPU 100% match)
Speedup	11.13× vs CPU (~4 ms/inference, 252 inf/s)
Real-time monitor	200 windows · 100% recall · 0 false alarms

Demo 1 — Accuracy validation. Per-class classification table over 40 test samples (CPU/NPU cycles, 11.13× speedup).

Demo 2 — Real-time monitor. 200-window stream (fault class, confidence; 100% recall, 0 false alarms).

nanoGPT LLM (FFN-on-NPU)

nanoGPT 10.77M (D=384, 6 layers, TinyShakespeare). Only the FFN (fc1 → gelu → fc2) is offloaded to the NPU; the rest runs on ARM — a hybrid L2 setup that lets an LLM run on XC7Z020.

Validation	CPU↔NPU text 5/5 bit-identical
Resources	DSP 9% · BRAM 2% · LUT 17% (FFN only)
Significance	Proves partial LLM inference is feasible on a Zynq-7000-class board

Demo 1 — Bit-identical match. 5 prompts × 16 characters generated — verified 5/5 CPU↔NPU text identity.

Demo 2 — Live stream. 50-character ROMEO prompt generation — FFN-on-NPU hybrid live inference.

※ All three projects are real silicon results synthesized directly from the same EulerNPU compiler output (not simulation). Board: QMTECH XC7Z020 CLG484-1 (Zynq-7000), PL clock 90–100 MHz, ARM Cortex-A9 PS. The next step in this same flow is ASIC synthesis.

Proven path

After proving on FPGA, the same flow goes to ASIC

Done · Public

FPGA Verification

KWS · Fault Detection · LLM
Zynq-7000 measured

→

Same compiler
Target switch

→

Long-term

Sovereign NPU

AI inference chip sovereignty
Edge · on-device

Core Features

138 operators, 10 DTypes, from spec.yaml to FPGA inference

138 operators (17 groups, A–Q)

All operations needed for NPU inference organized into 17 groups. Latest architecture coverage including Efficient Attention (FlashAttention · GQA), Vision Encoder, MoE/Sparse, Diffusion, Speculative Decoding.

▶ See all 17 groups

Core Math	MatMul, Add, Mul, Div, Sqrt and other basic math ops
Activation	ReLU, GELU, SiLU, Sigmoid, Softmax, etc.
Normalization	LayerNorm, RMSNorm, BatchNorm, GroupNorm
Conv/Vision	Conv2D, DepthwiseConv, Pool, Resize, Patch
Sequence/Attention	ScaledDotProduct, MultiHeadAttention, RoPE, ALiBi
Efficient Attention NEW	FlashAttention, SlidingWindowAttention, MultiQueryAttention (GQA)
MoE/Sparse	TopKRouter, ExpertDispatch, LoadBalanceLoss
Recurrent	LSTM, GRU, SRU
Graph	Concat, Split, Reshape, Transpose, Gather, Scatter
Multimodal	CrossAttention, VisionProjection, AudioMel
Vision Encoder NEW	PatchEmbed, ClsTokenPrepend, ImageNorm
Diffusion NEW	TimestepEmbed, NoiseSample, DDIMStep, CFGScale, FlowMatchStep
Speculative Decoding NEW	TokenAcceptance, DraftVerify, PrefixCacheLookup/Store
Quantization	Quantize, Dequantize, FakeQuantize, PackInt4/UnpackInt4
Mamba/SSM	SelectiveScan, Discretize, SSMConv
Cache Compress	KVCacheCompress, SlidingWindow, H2O
Autonomy	PointCloud, BEVProject, TrajectoryPredict

10-DType System

Classified into three tiers by precision and performance requirements.

Tier 0 (required)	fp32, int32 — supported by every operator
Tier 1 (recommended)	fp16, bf16, int8, uint8 — supported by most operators
Tier 2 (extended)	int16, int4, fp8_e4m3, fp8_e5m2 — specific operators

Execution backends (4)

cpu_ref	Host NumPy reference (runs instantly, no dependencies)
npu_sim	Functional simulation + execution trace + per-operator cycle/MAC/latency estimates
zynq_ps	Zynq ARM PS execution
zynq_pl_stub	FPGA PL offload analysis/emulation

FPGA board profiles

Zynq-7000	XC7Z020, AXI-Lite MMIO transport
Zynq UltraScale+	ZU3EG, ZU9EG (INT4 / high-performance target)

Design Principles

Deterministic, reproducible, and auditable inference at every step

Declarative Specification

All inference graphs are defined in spec.yaml — human-readable, version-controllable, and diffable. No hidden state or implicit configuration.

Bit-Exact Reproducibility

Simulation results are bit-exact across runs. The same spec.yaml always produces the same .npuart artifact and the same inference outputs.

Hardware-First Validation

Board-smoke tests verify hardware compatibility before deployment. Calibration and profiling ensure real-world performance matches simulation.

Compilation Pipeline

A 4-stage pipeline from spec.yaml to FPGA inference

Pipeline flow

spec.yaml (operator graph definition) | v [1] Validator --- operator/dtype/shape checks, graph integrity | v [2] Compiler --- operator fusion, memory layout, scheduling | v [3] .npuart --- serialized execution artifact (operators + weights + metadata) | v [4] Runtime --- CPU reference or Zynq-7000 / UltraScale+ FPGA execution

FPGA deployment pipeline

Step 1	Write spec.yaml and check it with `eulernpu validate`
Step 2	Generate the .npuart artifact with `eulernpu compile`
Step 3	Run a cycle-accurate host simulation with `eulernpu sim`
Step 4	Verify the FPGA board connection with `eulernpu board smoke`, then run it with `eulernpu run`

Additional tools

calibrate	Collect quantization calibration data
compress-cache	Apply KV-cache compression settings
benchmark	Latency/throughput benchmarks

CLI Reference

Single entry point eulernpu — 15 subcommands cover the entire workflow (--lang ko|en|zh|ja|es supported)

Command	Description
`eulernpu info`	Show platform, supported operators, and dtype information
`eulernpu validate`	Validate the spec.yaml operator graph (JSON-Schema + 23 semantic rules)
`eulernpu migrate-spec` NEW	Auto-migrate specs from 0.4 → 0.5
`eulernpu compile`	Compile spec.yaml into a .npuart artifact
`eulernpu run`	Run a .npuart artifact on the cpu_ref/npu_sim/zynq backends
`eulernpu sim`	Functional simulation + cycle/MAC/latency estimates
`eulernpu generate` NEW	Autoregressive token generation (KV cache)
`eulernpu quantize` NEW	INT8/INT4 weight quantization (`--weight-bits 4`)
`eulernpu profile`	Profile per-operator execution time and memory usage
`eulernpu explain`	Visualize the PL offload + memory plan and graph schedule
`eulernpu board smoke`	Verify FPGA board connectivity and basic operation
`eulernpu calibrate`	Collect and apply quantization calibration data
`eulernpu benchmark`	Run latency/throughput benchmarks
`eulernpu replay`	Replay a saved execution trace
`eulernpu compress-cache`	Apply and validate KV-cache compression settings

Tutorials

Step-by-step guides to get started with EulerNPU quickly

Tutorials coming soon.

Installation & Getting Started

Install EulerNPU and compile your first inference graph

Installation

pip install -e ".[dev]"

# Validate and compile
eulernpu validate spec.yaml
eulernpu compile spec.yaml -o model.npuart

Requirements

Python 3.10+, NumPy

Optional: ONNX import, Zynq-7000 / UltraScale+ boards (FPGA target)

GitHub

eulerwa/eulernpu

Start NPU Inference Development with EulerNPU

From spec.yaml to hardware deployment, in a single CLI.

Get Started on GitHub Contact Us