EulerNPU

NPU Inference Composition & Simulation Stack

An inference-first NPU full stack for edge AI. It composes an operator graph (spec.yaml) from 138 operators across 17 groups, then validates and compiles it into a .npuart artifact. Inference runs on a host CPU reference (cpu_ref), a functional NPU simulator (npu_sim), and Zynq-7000 (XC7Z020) · Zynq UltraScale+ (ZU3EG/ZU9EG) FPGAs, with INT4/INT8 quantization, profiling, cache compression, and calibration all driven from a single CLI. Design the chip and the model in YAML, and prove latency, memory, and power without silicon.

Open Source

Demo — Project 3: Embodied Multimodal AI SoC

A real terminal demo of a humanoid/robot-brain SoC running inference in the EulerNPU simulator

The reference chip Project 3 — Embodied Multimodal AI SoC is a robot-brain chip that fuses vision + LiDAR + language + proprioceptive state in parallel on a single NPU and generates motor actions with a Diffusion Policy (16-step horizon · 14-DOF). It exercises all 138 operators in a single pipeline and applies INT4 MoE weights. The demo walks through a 5-stage subsystem compile log → parallel pipeline execution → results table → 14-joint trajectory → FPGA-readiness checklist, then runs the pipeline twice to confirm bit-exact determinism.

TargetZynq UltraScale+ ZU9EG FPGA @ 300 MHz
BudgetLatency ≤ 150 ms · Power ≤ 3 W (INT4)
SubsystemsVision (ViT) · LiDAR BEV (PointPillar) · Language (GQA) · State (Mamba SSM) · Action (MoE + Diffusion Policy)
Verificationcpu_ref ↔ npu_sim numerical match (L∞ < 1e-4) · bit-exact determinism

※ Project 3 is a scaled-down, synthetic reference model for validating 138-operator coverage and FPGA readiness. The figures are stated against the ZU9EG budget (≤150 ms / ≤3 W).

Core Features

138 operators, 10 DTypes, from spec.yaml to FPGA inference

138 operators (17 groups, A–Q)

The operators needed for NPU inference, organized into 17 groups. Spec 0.5.0 adds the Efficient Attention, Vision Encoder, Diffusion, and Speculative Decoding groups.

Core MathMatMul, Add, Mul, Div, Sqrt and other basic math ops
ActivationReLU, GELU, SiLU, Sigmoid, Softmax, etc.
NormalizationLayerNorm, RMSNorm, BatchNorm, GroupNorm
Conv/VisionConv2D, DepthwiseConv, Pool, Resize, Patch
Sequence/AttentionScaledDotProduct, MultiHeadAttention, RoPE, ALiBi
Efficient Attention NEWFlashAttention, SlidingWindowAttention, MultiQueryAttention (GQA)
MoE/SparseTopKRouter, ExpertDispatch, LoadBalanceLoss
RecurrentLSTM, GRU, SRU
GraphConcat, Split, Reshape, Transpose, Gather, Scatter
MultimodalCrossAttention, VisionProjection, AudioMel
Vision Encoder NEWPatchEmbed, ClsTokenPrepend, ImageNorm
Diffusion NEWTimestepEmbed, NoiseSample, DDIMStep, CFGScale, FlowMatchStep
Speculative Decoding NEWTokenAcceptance, DraftVerify, PrefixCacheLookup/Store
QuantizationQuantize, Dequantize, FakeQuantize, PackInt4/UnpackInt4
Mamba/SSMSelectiveScan, Discretize, SSMConv
Cache CompressKVCacheCompress, SlidingWindow, H2O
AutonomyPointCloud, BEVProject, TrajectoryPredict

10-DType System

Classified into three tiers by precision and performance requirements.

Tier 0 (required) fp32, int32 — supported by every operator
Tier 1 (recommended) fp16, bf16, int8, uint8 — supported by most operators
Tier 2 (extended) int16, int4, fp8_e4m3, fp8_e5m2 — specific operators

Execution backends (4)

cpu_ref Host NumPy reference (runs instantly, no dependencies)
npu_sim Functional simulation + execution trace + per-operator cycle/MAC/latency estimates
zynq_ps Zynq ARM PS execution
zynq_pl_stub FPGA PL offload analysis/emulation

FPGA board profiles

Zynq-7000 XC7Z020, AXI-Lite MMIO transport
Zynq UltraScale+ ZU3EG, ZU9EG (INT4 / high-performance target)

Design Principles

Deterministic, reproducible, and auditable inference at every step

Declarative Specification

All inference graphs are defined in spec.yaml — human-readable, version-controllable, and diffable. No hidden state or implicit configuration.

Bit-Exact Reproducibility

Simulation results are bit-exact across runs. The same spec.yaml always produces the same .npuart artifact and the same inference outputs.

Hardware-First Validation

Board-smoke tests verify hardware compatibility before deployment. Calibration and profiling ensure real-world performance matches simulation.

Compilation Pipeline

A 4-stage pipeline from spec.yaml to FPGA inference

Pipeline flow

spec.yaml (operator graph definition) | v [1] Validator --- operator/dtype/shape checks, graph integrity | v [2] Compiler --- operator fusion, memory layout, scheduling | v [3] .npuart --- serialized execution artifact (operators + weights + metadata) | v [4] Runtime --- CPU reference or Zynq-7000 / UltraScale+ FPGA execution

FPGA deployment pipeline

Step 1 Write spec.yaml and check it with eulernpu validate
Step 2 Generate the .npuart artifact with eulernpu compile
Step 3 Run a cycle-accurate host simulation with eulernpu sim
Step 4 Verify the FPGA board connection with eulernpu board smoke, then run it with eulernpu run

Additional tools

calibrateCollect quantization calibration data
compress-cacheApply KV-cache compression settings
benchmarkLatency/throughput benchmarks

CLI Reference

Single entry point eulernpu — 15 subcommands cover the entire workflow (--lang ko|en|zh|ja|es supported)

Command Description
eulernpu infoShow platform, supported operators, and dtype information
eulernpu validateValidate the spec.yaml operator graph (JSON-Schema + 23 semantic rules)
eulernpu migrate-spec NEWAuto-migrate specs from 0.4 → 0.5
eulernpu compileCompile spec.yaml into a .npuart artifact
eulernpu runRun a .npuart artifact on the cpu_ref/npu_sim/zynq backends
eulernpu simFunctional simulation + cycle/MAC/latency estimates
eulernpu generate NEWAutoregressive token generation (KV cache)
eulernpu quantize NEWINT8/INT4 weight quantization (--weight-bits 4)
eulernpu profileProfile per-operator execution time and memory usage
eulernpu explainVisualize the PL offload + memory plan and graph schedule
eulernpu board smokeVerify FPGA board connectivity and basic operation
eulernpu calibrateCollect and apply quantization calibration data
eulernpu benchmarkRun latency/throughput benchmarks
eulernpu replayReplay a saved execution trace
eulernpu compress-cacheApply and validate KV-cache compression settings

Tutorials

Step-by-step guides to get started with EulerNPU quickly

Tutorials coming soon.

Installation & Getting Started

Install EulerNPU and compile your first inference graph

Installation

pip install -e ".[dev]"

# Validate and compile
eulernpu validate spec.yaml
eulernpu compile spec.yaml -o model.npuart

Requirements

Python 3.10+, NumPy

Optional: ONNX import, Zynq-7000 / UltraScale+ boards (FPGA target)

Start NPU Inference Development with EulerNPU

From spec.yaml to hardware deployment, in a single CLI.

Get Started on GitHub Contact Us