NPU Inference Composition & Simulation Stack
An inference-first NPU full stack for edge AI. It composes an operator graph (spec.yaml) from 138 operators across 17 groups, then validates and compiles it into a .npuart artifact. Inference runs on a host CPU reference (cpu_ref), a functional NPU simulator (npu_sim), and Zynq-7000 (XC7Z020) · Zynq UltraScale+ (ZU3EG/ZU9EG) FPGAs, with INT4/INT8 quantization, profiling, cache compression, and calibration all driven from a single CLI. Design the chip and the model in YAML, and prove latency, memory, and power without silicon.
Open SourceA real terminal demo of a humanoid/robot-brain SoC running inference in the EulerNPU simulator
The reference chip Project 3 — Embodied Multimodal AI SoC is a robot-brain chip that fuses vision + LiDAR + language + proprioceptive state in parallel on a single NPU and generates motor actions with a Diffusion Policy (16-step horizon · 14-DOF). It exercises all 138 operators in a single pipeline and applies INT4 MoE weights. The demo walks through a 5-stage subsystem compile log → parallel pipeline execution → results table → 14-joint trajectory → FPGA-readiness checklist, then runs the pipeline twice to confirm bit-exact determinism.
| Target | Zynq UltraScale+ ZU9EG FPGA @ 300 MHz |
|---|---|
| Budget | Latency ≤ 150 ms · Power ≤ 3 W (INT4) |
| Subsystems | Vision (ViT) · LiDAR BEV (PointPillar) · Language (GQA) · State (Mamba SSM) · Action (MoE + Diffusion Policy) |
| Verification | cpu_ref ↔ npu_sim numerical match (L∞ < 1e-4) · bit-exact determinism |
※ Project 3 is a scaled-down, synthetic reference model for validating 138-operator coverage and FPGA readiness. The figures are stated against the ZU9EG budget (≤150 ms / ≤3 W).
138 operators, 10 DTypes, from spec.yaml to FPGA inference
The operators needed for NPU inference, organized into 17 groups. Spec 0.5.0 adds the Efficient Attention, Vision Encoder, Diffusion, and Speculative Decoding groups.
| Core Math | MatMul, Add, Mul, Div, Sqrt and other basic math ops |
|---|---|
| Activation | ReLU, GELU, SiLU, Sigmoid, Softmax, etc. |
| Normalization | LayerNorm, RMSNorm, BatchNorm, GroupNorm |
| Conv/Vision | Conv2D, DepthwiseConv, Pool, Resize, Patch |
| Sequence/Attention | ScaledDotProduct, MultiHeadAttention, RoPE, ALiBi |
| Efficient Attention NEW | FlashAttention, SlidingWindowAttention, MultiQueryAttention (GQA) |
| MoE/Sparse | TopKRouter, ExpertDispatch, LoadBalanceLoss |
| Recurrent | LSTM, GRU, SRU |
| Graph | Concat, Split, Reshape, Transpose, Gather, Scatter |
| Multimodal | CrossAttention, VisionProjection, AudioMel |
| Vision Encoder NEW | PatchEmbed, ClsTokenPrepend, ImageNorm |
| Diffusion NEW | TimestepEmbed, NoiseSample, DDIMStep, CFGScale, FlowMatchStep |
| Speculative Decoding NEW | TokenAcceptance, DraftVerify, PrefixCacheLookup/Store |
| Quantization | Quantize, Dequantize, FakeQuantize, PackInt4/UnpackInt4 |
| Mamba/SSM | SelectiveScan, Discretize, SSMConv |
| Cache Compress | KVCacheCompress, SlidingWindow, H2O |
| Autonomy | PointCloud, BEVProject, TrajectoryPredict |
Classified into three tiers by precision and performance requirements.
| Tier 0 (required) | fp32, int32 — supported by every operator |
|---|---|
| Tier 1 (recommended) | fp16, bf16, int8, uint8 — supported by most operators |
| Tier 2 (extended) | int16, int4, fp8_e4m3, fp8_e5m2 — specific operators |
| cpu_ref | Host NumPy reference (runs instantly, no dependencies) |
|---|---|
| npu_sim | Functional simulation + execution trace + per-operator cycle/MAC/latency estimates |
| zynq_ps | Zynq ARM PS execution |
| zynq_pl_stub | FPGA PL offload analysis/emulation |
| Zynq-7000 | XC7Z020, AXI-Lite MMIO transport |
|---|---|
| Zynq UltraScale+ | ZU3EG, ZU9EG (INT4 / high-performance target) |
Deterministic, reproducible, and auditable inference at every step
All inference graphs are defined in spec.yaml — human-readable, version-controllable, and diffable. No hidden state or implicit configuration.
Simulation results are bit-exact across runs. The same spec.yaml always produces the same .npuart artifact and the same inference outputs.
Board-smoke tests verify hardware compatibility before deployment. Calibration and profiling ensure real-world performance matches simulation.
A 4-stage pipeline from spec.yaml to FPGA inference
| Step 1 | Write spec.yaml and check it with eulernpu validate |
|---|---|
| Step 2 | Generate the .npuart artifact with eulernpu compile |
| Step 3 | Run a cycle-accurate host simulation with eulernpu sim |
| Step 4 | Verify the FPGA board connection with eulernpu board smoke, then run it with eulernpu run |
| calibrate | Collect quantization calibration data |
|---|---|
| compress-cache | Apply KV-cache compression settings |
| benchmark | Latency/throughput benchmarks |
Single entry point eulernpu — 15 subcommands cover the entire workflow (--lang ko|en|zh|ja|es supported)
| Command | Description |
|---|---|
eulernpu info | Show platform, supported operators, and dtype information |
eulernpu validate | Validate the spec.yaml operator graph (JSON-Schema + 23 semantic rules) |
eulernpu migrate-spec NEW | Auto-migrate specs from 0.4 → 0.5 |
eulernpu compile | Compile spec.yaml into a .npuart artifact |
eulernpu run | Run a .npuart artifact on the cpu_ref/npu_sim/zynq backends |
eulernpu sim | Functional simulation + cycle/MAC/latency estimates |
eulernpu generate NEW | Autoregressive token generation (KV cache) |
eulernpu quantize NEW | INT8/INT4 weight quantization (--weight-bits 4) |
eulernpu profile | Profile per-operator execution time and memory usage |
eulernpu explain | Visualize the PL offload + memory plan and graph schedule |
eulernpu board smoke | Verify FPGA board connectivity and basic operation |
eulernpu calibrate | Collect and apply quantization calibration data |
eulernpu benchmark | Run latency/throughput benchmarks |
eulernpu replay | Replay a saved execution trace |
eulernpu compress-cache | Apply and validate KV-cache compression settings |
Step-by-step guides to get started with EulerNPU quickly
Tutorials coming soon.
Install EulerNPU and compile your first inference graph
Python 3.10+, NumPy
Optional: ONNX import, Zynq-7000 / UltraScale+ boards (FPGA target)
From spec.yaml to hardware deployment, in a single CLI.
Get Started on GitHub Contact Us