An Architecture Description Language (ADL) for LLMs
EulerStack is an Architecture Description Language (ADL) for LLMs. It separates architecture out of the PyTorch files where structure, training, and serving usually live tangled together, and expresses it in a declarative language built for that purpose — the same abstraction step the semiconductor industry took when Verilog and VHDL replaced schematics-plus-C. A single YAML spec flows through a 5-layer pipeline (DSL → Schema → IR → Compiler → CLI) for validation and normalization, and compile --output-dir emits a HuggingFace model directory (config.json + model.safetensors) that hands off directly to EulerForge for training. 57 presets (24 llm_ + 33 arch_, of which 9 are arch_expert_*_mini) are organized along a 3-tier learning path (validated industrial → recent hybrid/MoE → v1 experimental primitives), and all CLI messages are translated into 5 languages (ko / en / zh / ja / es). v0.1.5 adds μP scaling (training_hints.scaling), differentiation auxiliary objectives (training_hints.differentiation_objectives), and an organ declaration (tissue) as backward-compatible spec extensions — all default OFF, and existing v0.1.4 YAML works unchanged.
Define named layer templates (mixer + FFN + norm + residual) and use schedules to specify arrangement order and repetition counts.
| Mixer Types | Attention, Mamba, RetNet, Hyena |
|---|---|
| FFN Types | MLP, Gated MLP (SwiGLU), MoE (top-k routing) |
| Norm | RMSNorm, LayerNorm (pre/post position) |
| Residual | Sequential, Parallel, Hyper-Connection (mHC) |
| Head | causal_lm, causal_lm_mtp (Multi-Token Prediction) |
A 3-stage process — schema structure → cross-field compatibility → heuristic realism checks — catches design errors before compilation. Every error is printed in the 3-line format (Category: what / Fix: / See:).
| Structure | Unknown keys, type/enum, required fields, positive constraints |
|---|---|
| Compatibility | Mixer↔state mismatches (e.g., mamba + kv_cache forbidden) |
| Realism | head_dim range (32–256), target_params mismatch (>30%), MoE expert ratio, seq_len/d_model ratio, family_hint consistency, vocab/tokenizer consistency, tie_weight consistency, rope_scaling bounds |
| Error Categories | ValidationError, CompatibilityError, CompileError, NormalizationError |
A declarative spec of ~10 lines fully describes the shape of a model.
v0.1.5 spec extensions (optional, default OFF) — μP scaling, differentiation auxiliary objectives, and the organ (tissue) declaration:
Organized by the v1 "industrial ordering principle": validated industrial → recent hybrid/MoE → v1 experimental primitives. The order is the recommended learning path — start with what industry has already proven, move through recent hybrid/MoE research, then explore v1's new primitives (MLA / MoD / Titans / Dual-Stream) at arch-scale. 24 llm_ + 33 arch_ = 57 (33 arch_ = beginner 2 · intermediate 3 · advanced 5 · expert 23, of which 9 are *_mini). Presets are only starting points — edit d_model, n_heads, and layer count to assemble a model at any scale.
Production-grade baselines. Training recipes are well-studied; failure modes are known. Works stably from 0.1B (Stage-1 / CPT warm-up) up to 16B.
| Preset | ~Params | One-liner | Research basis |
|---|---|---|---|
arch_beginner_gpt2 | ~1.1B | Classic Transformer (MHA + LayerNorm post + GeLU) | Vaswani 2017, GPT-2 |
arch_beginner_llama | ~1.1B | Modern baseline (GQA + RMSNorm pre + SwiGLU) | Llama 2/3 |
arch_intermediate_mistral | ~1.3B | 1 global : 3 sliding attention | Mistral 7B |
arch_intermediate_gemma2 | ~1.3B | 1:1 alternating global/local | Gemma 2 |
arch_intermediate_qwen_longctx | ~1.3B | RoPE scaling factor 4, 32K ctx | Qwen 2/3 |
llm_0p1b_{simple,mistral} | ~100M | Stage-1 / CPT warm-up | Sovereign-foundation pilot |
llm_*_simple (0.8B–16B) | 0.8B–16B | Pure attention (Llama) | — |
llm_*_mistral (0.8B–16B) | 0.8B–16B | Attention + sliding window | Mistral 7B |
Modern compositions that are already in production somewhere. Sized for 24 GB GPU training. The expert level crosses MoE strategy × mixer × depth/receptive-field as a 3D design space; four entries are speculative compositions not yet in the literature.
| Level | Preset | ~Params | One-liner | Research basis |
|---|---|---|---|---|
| advanced | arch_advanced_jamba | ~1.2B | Mamba + Attention 3:1 hybrid | Jamba-1.5 (AI21 2024) |
| advanced | arch_advanced_samba | ~1.0B | Mamba + Sliding attention 1:1 | Samba (Microsoft 2024) |
| advanced | arch_advanced_retnet | ~1.3B | Pure RetNet (attention-free) | Sun 2023 |
| advanced (v1 B2.1) | arch_advanced_mla | ~1.1B | MLA — KV compressed via latent_dim | DeepSeek-V3 (2024) |
| advanced (v1 B3.1) | arch_advanced_mod | ~1.1B | Mixture-of-Depths (token-level layer skip) | Raposo ICML 2024 |
| expert | arch_expert_research | ~1.5B | 4 mixers + MoE 3-phase | Research-grade |
| expert | arch_expert_mixtral_moe | ~1.9B | Pure attn + every-layer MoE (8 × top-2) | Mixtral 8x7B (2024) |
| expert | arch_expert_striped_hyena | ~1.0B | Hyena + Attention 4:1, 128K | StripedHyena |
| expert | arch_expert_blackmamba_moe | ~1.5B | Mamba + MoE (MoE on non-attn mixer) | BlackMamba, MoE-Mamba |
| expert | arch_expert_deepseek_moe | ~2.0B | Fine-grained MoE (32 × top-3) | DeepSeek-V2/V3 (2024) |
| expert NEW | arch_expert_dsv4_v3fallback | ~2.0B | DeepSeek-V4 schema (V3 fallback path) | DeepSeek-V3/V4 |
| expert (speculative) | arch_expert_retnet_moe | ~1.5B | RetNet + MoE (no paper) | Sun 2023 + MoE extrapolation |
| expert (speculative) | arch_expert_frontier_full_moe | ~2.0B | Attention-free, multi-mixer + all-MoE (most speculative) | Composition prediction |
| expert (speculative) | arch_expert_progressive_stack | ~1.5B | Depth-wise hyena→mamba→retnet→attn+MoE (no paper) | Hierarchical prediction |
| expert (speculative) | arch_expert_dilated_longnet | ~2.0B | Temporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper) | Longnet + Jamba extrapolation |
| expert (capstone) | arch_expert_kitchen_sink | — | Combines every available primitive in one spec for max-surface validation | Aggregate validation |
Each preset showcases one Phase B primitive at arch-scale (~1.2–1.4B). Schema-complete; runtime is partial — the compiler falls back to a standard block for un-implemented mixers, but the full spec metadata round-trips via config.v1_extensions. The experience is "declare a Phase B primitive in YAML, compile, save as an HF custom model".
| Preset | ~Params | One-liner | Research basis |
|---|---|---|---|
arch_expert_reasoning_r1 | ~1.3B | 2-phase reasoning (think / answer) | DeepSeek-R1 (2025), Quiet-STaR |
arch_expert_titans_memory | ~1.2B | Parametric memory + test-time update | Titans (Google 2024–2025) |
arch_expert_dual_stream | ~1.4B | Monoidal parallel (Mamba ∥ Attention) | Jamba × PaLM generalization |
arch_expert_*_mini — Small-scale speculative experts (9, ~80M–360M)Mini variants of the speculative expert architectures. Same design ideas, dramatically smaller (d_model 384–512, ~12 layers) so that a full training ablation fits on a single consumer GPU. Intended for quickly iterating on architectural hypotheses before committing to a 2B training run. arch_expert_progressive_stack_mini is the recommended starting point.
| Preset | ~Total | ~Active | Mirror of | Pedagogical role |
|---|---|---|---|---|
arch_expert_progressive_stack_mini | ~86M | ~86M | arch_expert_progressive_stack | RECOMMENDED first experiment |
arch_expert_blackmamba_moe_mini | ~156M | ~90M | arch_expert_blackmamba_moe | Partial-sparse MoE on SSM |
arch_expert_mixtral_moe_mini | ~175M | ~90M | arch_expert_mixtral_moe | Classic every-layer MoE baseline |
arch_expert_dilated_longnet_mini | ~83M | ~75M | arch_expert_dilated_longnet | Long-context temporal pyramid |
arch_expert_deepseek_moe_mini | ~357M | ~60M | arch_expert_deepseek_moe | ⚠ Observe fine-grained MoE failure |
arch_expert_frontier_full_moe_mini | ~106M | ~60M | arch_expert_frontier_full_moe | ⚠ Most experimental; expected to fail |
arch_expert_dsv4_flash_mini NEW | ~180M | ~70M | DeepSeek-V4 | DSv4 + Flash/NSA compressed attention |
arch_expert_dsv4_subset_mini NEW | ~180M | ~70M | DeepSeek-V4 | DSv4 feature subset |
arch_expert_mhc_moe_mini NEW | ~150M | ~70M | mHC + MoE | multi-Hyper-Connection residual + MoE |
llm_ — Size × Architectural Variant (24)5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × 5 variants (simple / mistral / jamba / moe / mla). moe is omitted at 0.1B.
| Scale | simple | mistral | jamba | moe | mla |
|---|---|---|---|---|---|
| 0.1B | llm_0p1b_simple | llm_0p1b_mistral | llm_0p1b_jamba | — | llm_0p1b_mla |
| 0.8B | llm_0p8b_simple | llm_0p8b_mistral | llm_0p8b_jamba | llm_0p8b_moe | llm_0p8b_mla |
| 2B | llm_2b_simple | llm_2b_mistral | llm_2b_jamba | llm_2b_moe | llm_2b_mla |
| 4B | llm_4b_simple | llm_4b_mistral | llm_4b_jamba | llm_4b_moe | llm_4b_mla |
| 16B | llm_16b_simple | llm_16b_mistral | llm_16b_jamba | llm_16b_moe | llm_16b_mla |
Variant semantics: simple = pure attention (Llama) · mistral = attention + sliding window (1 global : 3 sliding per 4 layers) · jamba = Mamba + Attention hybrid (3:1) · moe = attention + MoE FFN (1-in-4 layers, 8 experts, top-2) · mla = Multi-head Latent Attention (DeepSeek-V3 style KV compression).
No upper limit — presets are starting points. EulerStack can assemble a model of any size by editing d_model, n_heads, and layer count.
Follows the eulerwa CLI family convention. All errors are printed in the 3-line format (Category: what / Fix: / See:).
validate |
Validate a YAML spec (--report includes realism checks) |
|---|---|
explain |
Human-readable model summary (layers, parameter estimate) |
compile |
IR → JSON runtime config (--output) or HF model directory (--output-dir) |
schema |
Print YAML schema structure |
presets list / show |
Enumerate presets or show details for one |
--lang |
Output language (ko/en/zh/ja/es). Root option; default ko |
|---|---|
--preset |
YAML spec file path |
--validate-only |
Validate and exit without further work |
--output / -o |
JSON runtime config output path |
--output-dir |
HF model directory output (config.json + model.safetensors) |
--print-config / --dry-run |
Print resolved config to stdout |
Every CLI help page, log message, warning, and error is translated into ko / en / zh / ja / es. Default language is Korean (ko); switch via the --lang root option or the EULERSTACK_LANG environment variable. Command names, option names, and the Fix: / See: labels in the 3-line error format stay untranslated for script compatibility.
compile --output-dir writes a HuggingFace-compatible directory (config.json + model.safetensors) — the primary handoff path into the EulerForge training pipeline.
From YAML spec to a trainable model — 5 layers with strict separation of concerns.
| Layer 1: DSL | User-authored YAML spec (schema_version 1, declarative model definition) |
|---|---|
| Layer 2: Schema | Structural validation — unknown keys, type/enum, required fields, cross-field compatibility |
| Layer 3: IR | Normalized canonical structure (default fills, template expansion) |
| Layer 4: Compiler | IR → JSON runtime config or HF model directory (config.json + model.safetensors) — loadable via AutoModelForCausalLM.from_pretrained() for EulerForge training |
| Layer 5: CLI | validate / explain / compile / schema / presets — all messages i18n-translated across 5 languages |
Tutorials are maintained in Korean (ko) and English (en) and are available directly on this site under /en/products/eulerstack/tutorials/ (upstream path: docs/tutorials/{ko,en}/).
00_positioning | Read first — where EulerStack fits: an Architecture Description Language (ADL) for LLMs |
|---|---|
01_validate_a_spec | Validate a YAML spec |
02_use_presets | Use presets |
03_spec_reference | Spec reference |
04_compile_and_explain | Compile & explain |
05_prepare_data | Prepare training data |
06_sanity_train | Sanity training loop |
07_arch_walkthrough | Skill-level architecture walkthrough (tour of the arch_ presets) |
08_expert_mini_walkthrough | Expert mini preset walkthrough (single-GPU ablation) |
09_new_primitives_walkthrough | NEW — v1 Phase B primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT) |
10_paper_to_yaml | NEW — paper → YAML case studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans) |
mixers/, 5)00_overview | Mixers concept — why mix attention / mamba / retnet / hyena |
|---|---|
01_attention | Attention in depth |
02_mamba | Mamba in depth |
03_retnet | RetNet in depth |
04_hyena | Hyena in depth |
Combine Attention, Mamba, RetNet, Hyena, and MoE into hybrid models with a single YAML spec — then hand off the HuggingFace model directory to EulerForge for training.
Get Started on GitHub