EulerStack

An Architecture Description Language (ADL) for LLMs

EulerStack is an Architecture Description Language (ADL) for LLMs. It separates architecture out of the PyTorch files where structure, training, and serving usually live tangled together, and expresses it in a declarative language built for that purpose — the same abstraction step the semiconductor industry took when Verilog and VHDL replaced schematics-plus-C. A single YAML spec flows through a 5-layer pipeline (DSL → Schema → IR → Compiler → CLI) for validation and normalization, and compile --output-dir emits a HuggingFace model directory (config.json + model.safetensors) that hands off directly to EulerForge for training. 57 presets (24 llm_ + 33 arch_, of which 9 are arch_expert_*_mini) are organized along a 3-tier learning path (validated industrial → recent hybrid/MoE → v1 experimental primitives), and all CLI messages are translated into 5 languages (ko / en / zh / ja / es). v0.1.5 adds μP scaling (training_hints.scaling), differentiation auxiliary objectives (training_hints.differentiation_objectives), and an organ declaration (tissue) as backward-compatible spec extensions — all default OFF, and existing v0.1.4 YAML works unchanged.

Tutorials (16) CLI Reference

Core Features

Layer Templates & Schedule

Define named layer templates (mixer + FFN + norm + residual) and use schedules to specify arrangement order and repetition counts.

Mixer Types Attention, Mamba, RetNet, Hyena
FFN Types MLP, Gated MLP (SwiGLU), MoE (top-k routing)
Norm RMSNorm, LayerNorm (pre/post position)
Residual Sequential, Parallel, Hyper-Connection (mHC)
Head causal_lm, causal_lm_mtp (Multi-Token Prediction)

Validation & Realism

A 3-stage process — schema structure → cross-field compatibility → heuristic realism checks — catches design errors before compilation. Every error is printed in the 3-line format (Category: what / Fix: / See:).

Structure Unknown keys, type/enum, required fields, positive constraints
Compatibility Mixer↔state mismatches (e.g., mamba + kv_cache forbidden)
Realism head_dim range (32–256), target_params mismatch (>30%), MoE expert ratio, seq_len/d_model ratio, family_hint consistency, vocab/tokenizer consistency, tie_weight consistency, rope_scaling bounds
Error Categories ValidationError, CompatibilityError, CompileError, NormalizationError

Start with a single YAML

A declarative spec of ~10 lines fully describes the shape of a model.

schema_version: 1 model: { name: "my-llm", d_model: 2048, vocab_size: 32000, max_seq_len: 4096, n_heads: 16 } tokenizer_contract: { type: hf, pretrained: gpt2 } embedding: { type: learned, positional: rope } layer_templates: decoder: mixer: { type: attention, attention: {} } ffn: { type: gated_mlp, activation: swiglu } layer_schedule: - { template: decoder, repeat: 24 } head: { type: causal_lm }

v0.1.5 spec extensions (optional, default OFF) — μP scaling, differentiation auxiliary objectives, and the organ (tissue) declaration:

# Add the following to the spec above (existing YAML works unchanged) training_hints: scaling: { parametrization: mup, base_width: 256 } # μP (W-AS-1) differentiation_objectives: { usage_probe_coef: 0.01 } # differentiation aux objective (W-AS-2) tissue: # organ/column declaration (W-AS-3) columns: - { name: global_integration, templates: [decoder], role: global_binding } connectivity: ring

Presets: 57 on a 3-tier learning path

Organized by the v1 "industrial ordering principle": validated industrial → recent hybrid/MoE → v1 experimental primitives. The order is the recommended learning path — start with what industry has already proven, move through recent hybrid/MoE research, then explore v1's new primitives (MLA / MoD / Titans / Dual-Stream) at arch-scale. 24 llm_ + 33 arch_ = 57 (33 arch_ = beginner 2 · intermediate 3 · advanced 5 · expert 23, of which 9 are *_mini). Presets are only starting points — edit d_model, n_heads, and layer count to assemble a model at any scale.

Tier 1 — Validated industrial

Production-grade baselines. Training recipes are well-studied; failure modes are known. Works stably from 0.1B (Stage-1 / CPT warm-up) up to 16B.

Preset~ParamsOne-linerResearch basis
arch_beginner_gpt2~1.1BClassic Transformer (MHA + LayerNorm post + GeLU)Vaswani 2017, GPT-2
arch_beginner_llama~1.1BModern baseline (GQA + RMSNorm pre + SwiGLU)Llama 2/3
arch_intermediate_mistral~1.3B1 global : 3 sliding attentionMistral 7B
arch_intermediate_gemma2~1.3B1:1 alternating global/localGemma 2
arch_intermediate_qwen_longctx~1.3BRoPE scaling factor 4, 32K ctxQwen 2/3
llm_0p1b_{simple,mistral}~100MStage-1 / CPT warm-upSovereign-foundation pilot
llm_*_simple (0.8B–16B)0.8B–16BPure attention (Llama)
llm_*_mistral (0.8B–16B)0.8B–16BAttention + sliding windowMistral 7B

Tier 2 — Recent hybrid / MoE / long-context / KV-compressed

Modern compositions that are already in production somewhere. Sized for 24 GB GPU training. The expert level crosses MoE strategy × mixer × depth/receptive-field as a 3D design space; four entries are speculative compositions not yet in the literature.

LevelPreset~ParamsOne-linerResearch basis
advancedarch_advanced_jamba~1.2BMamba + Attention 3:1 hybridJamba-1.5 (AI21 2024)
advancedarch_advanced_samba~1.0BMamba + Sliding attention 1:1Samba (Microsoft 2024)
advancedarch_advanced_retnet~1.3BPure RetNet (attention-free)Sun 2023
advanced (v1 B2.1)arch_advanced_mla~1.1BMLA — KV compressed via latent_dimDeepSeek-V3 (2024)
advanced (v1 B3.1)arch_advanced_mod~1.1BMixture-of-Depths (token-level layer skip)Raposo ICML 2024
expertarch_expert_research~1.5B4 mixers + MoE 3-phaseResearch-grade
expertarch_expert_mixtral_moe~1.9BPure attn + every-layer MoE (8 × top-2)Mixtral 8x7B (2024)
expertarch_expert_striped_hyena~1.0BHyena + Attention 4:1, 128KStripedHyena
expertarch_expert_blackmamba_moe~1.5BMamba + MoE (MoE on non-attn mixer)BlackMamba, MoE-Mamba
expertarch_expert_deepseek_moe~2.0BFine-grained MoE (32 × top-3)DeepSeek-V2/V3 (2024)
expert NEWarch_expert_dsv4_v3fallback~2.0BDeepSeek-V4 schema (V3 fallback path)DeepSeek-V3/V4
expert (speculative)arch_expert_retnet_moe~1.5BRetNet + MoE (no paper)Sun 2023 + MoE extrapolation
expert (speculative)arch_expert_frontier_full_moe~2.0BAttention-free, multi-mixer + all-MoE (most speculative)Composition prediction
expert (speculative)arch_expert_progressive_stack~1.5BDepth-wise hyena→mamba→retnet→attn+MoE (no paper)Hierarchical prediction
expert (speculative)arch_expert_dilated_longnet~2.0BTemporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper)Longnet + Jamba extrapolation
expert (capstone)arch_expert_kitchen_sinkCombines every available primitive in one spec for max-surface validationAggregate validation

Tier 3 — v1 experimental primitives (Phase B at arch-scale)

Each preset showcases one Phase B primitive at arch-scale (~1.2–1.4B). Schema-complete; runtime is partial — the compiler falls back to a standard block for un-implemented mixers, but the full spec metadata round-trips via config.v1_extensions. The experience is "declare a Phase B primitive in YAML, compile, save as an HF custom model".

Preset~ParamsOne-linerResearch basis
arch_expert_reasoning_r1~1.3B2-phase reasoning (think / answer)DeepSeek-R1 (2025), Quiet-STaR
arch_expert_titans_memory~1.2BParametric memory + test-time updateTitans (Google 2024–2025)
arch_expert_dual_stream~1.4BMonoidal parallel (Mamba ∥ Attention)Jamba × PaLM generalization

arch_expert_*_mini — Small-scale speculative experts (9, ~80M–360M)

Mini variants of the speculative expert architectures. Same design ideas, dramatically smaller (d_model 384–512, ~12 layers) so that a full training ablation fits on a single consumer GPU. Intended for quickly iterating on architectural hypotheses before committing to a 2B training run. arch_expert_progressive_stack_mini is the recommended starting point.

Preset~Total~ActiveMirror ofPedagogical role
arch_expert_progressive_stack_mini~86M~86March_expert_progressive_stackRECOMMENDED first experiment
arch_expert_blackmamba_moe_mini~156M~90March_expert_blackmamba_moePartial-sparse MoE on SSM
arch_expert_mixtral_moe_mini~175M~90March_expert_mixtral_moeClassic every-layer MoE baseline
arch_expert_dilated_longnet_mini~83M~75March_expert_dilated_longnetLong-context temporal pyramid
arch_expert_deepseek_moe_mini~357M~60March_expert_deepseek_moe⚠ Observe fine-grained MoE failure
arch_expert_frontier_full_moe_mini~106M~60March_expert_frontier_full_moe⚠ Most experimental; expected to fail
arch_expert_dsv4_flash_mini NEW~180M~70MDeepSeek-V4DSv4 + Flash/NSA compressed attention
arch_expert_dsv4_subset_mini NEW~180M~70MDeepSeek-V4DSv4 feature subset
arch_expert_mhc_moe_mini NEW~150M~70MmHC + MoEmulti-Hyper-Connection residual + MoE

llm_ — Size × Architectural Variant (24)

5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × 5 variants (simple / mistral / jamba / moe / mla). moe is omitted at 0.1B.

Scalesimplemistraljambamoemla
0.1Bllm_0p1b_simplellm_0p1b_mistralllm_0p1b_jamballm_0p1b_mla
0.8Bllm_0p8b_simplellm_0p8b_mistralllm_0p8b_jamballm_0p8b_moellm_0p8b_mla
2Bllm_2b_simplellm_2b_mistralllm_2b_jamballm_2b_moellm_2b_mla
4Bllm_4b_simplellm_4b_mistralllm_4b_jamballm_4b_moellm_4b_mla
16Bllm_16b_simplellm_16b_mistralllm_16b_jamballm_16b_moellm_16b_mla

Variant semantics: simple = pure attention (Llama) · mistral = attention + sliding window (1 global : 3 sliding per 4 layers) · jamba = Mamba + Attention hybrid (3:1) · moe = attention + MoE FFN (1-in-4 layers, 8 experts, top-2) · mla = Multi-head Latent Attention (DeepSeek-V3 style KV compression).

No upper limit — presets are starting points. EulerStack can assemble a model of any size by editing d_model, n_heads, and layer count.

CLI Reference

Follows the eulerwa CLI family convention. All errors are printed in the 3-line format (Category: what / Fix: / See:).

Top-Level Commands

validate Validate a YAML spec (--report includes realism checks)
explain Human-readable model summary (layers, parameter estimate)
compile IR → JSON runtime config (--output) or HF model directory (--output-dir)
schema Print YAML schema structure
presets list / show Enumerate presets or show details for one

Common Options

--lang Output language (ko/en/zh/ja/es). Root option; default ko
--preset YAML spec file path
--validate-only Validate and exit without further work
--output / -o JSON runtime config output path
--output-dir HF model directory output (config.json + model.safetensors)
--print-config / --dry-run Print resolved config to stdout

5-language i18n CLI

Every CLI help page, log message, warning, and error is translated into ko / en / zh / ja / es. Default language is Korean (ko); switch via the --lang root option or the EULERSTACK_LANG environment variable. Command names, option names, and the Fix: / See: labels in the 3-line error format stay untranslated for script compatibility.

eulerstack validate --preset my_model.yml
# Korean (default)

eulerstack --lang en validate --preset my_model.yml
# English

EULERSTACK_LANG=ja eulerstack validate --preset my_model.yml
# env var also works

HF model directory → EulerForge training

compile --output-dir writes a HuggingFace-compatible directory (config.json + model.safetensors) — the primary handoff path into the EulerForge training pipeline.

eulerstack compile --preset my_model.yml --output-dir ./my_model

# Load it from Python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("./my_model", trust_remote_code=True)

5-Layer Architecture

From YAML spec to a trainable model — 5 layers with strict separation of concerns.

Layer 1: DSL User-authored YAML spec (schema_version 1, declarative model definition)
Layer 2: Schema Structural validation — unknown keys, type/enum, required fields, cross-field compatibility
Layer 3: IR Normalized canonical structure (default fills, template expansion)
Layer 4: Compiler IR → JSON runtime config or HF model directory (config.json + model.safetensors) — loadable via AutoModelForCausalLM.from_pretrained() for EulerForge training
Layer 5: CLI validate / explain / compile / schema / presets — all messages i18n-translated across 5 languages

Tutorials

Tutorials are maintained in Korean (ko) and English (en) and are available directly on this site under /en/products/eulerstack/tutorials/ (upstream path: docs/tutorials/{ko,en}/).

Core Tutorials (11)

00_positioningRead first — where EulerStack fits: an Architecture Description Language (ADL) for LLMs
01_validate_a_specValidate a YAML spec
02_use_presetsUse presets
03_spec_referenceSpec reference
04_compile_and_explainCompile & explain
05_prepare_dataPrepare training data
06_sanity_trainSanity training loop
07_arch_walkthroughSkill-level architecture walkthrough (tour of the arch_ presets)
08_expert_mini_walkthroughExpert mini preset walkthrough (single-GPU ablation)
09_new_primitives_walkthroughNEW — v1 Phase B primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)
10_paper_to_yamlNEW — paper → YAML case studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans)

Mixer deep dives (mixers/, 5)

00_overviewMixers concept — why mix attention / mamba / retnet / hyena
01_attentionAttention in depth
02_mambaMamba in depth
03_retnetRetNet in depth
04_hyenaHyena in depth

Install & Quickstart

Install

pip install -e .

# or include dev dependencies
pip install -e ".[dev]"

Quickstart

# List presets (Korean default)
eulerstack presets list

# Validate with realism report
eulerstack validate --preset my_model.yml --report

# Build an HF model directory → hand off to EulerForge training
eulerstack compile --preset my_model.yml --output-dir ./my_model

# Switch CLI messages to English
eulerstack --lang en validate --preset my_model.yml

Design LLM Architectures with EulerStack

Combine Attention, Mamba, RetNet, Hyena, and MoE into hybrid models with a single YAML spec — then hand off the HuggingFace model directory to EulerForge for training.

Get Started on GitHub