EulerStack

An Architecture Description Language (ADL) for LLMs + a research framework for proving new structures

Define LLM architecture in one YAML. Separate "architecture" from the tangled Python model file with a declarative language — the same abstraction direction the semiconductor industry took when replacing schematics+C with Verilog/VHDL.

5-layer pipeline hands off to EulerForge training. DSL → Schema → IR → Compiler → CLI validates, normalizes, and compiles the spec; compile --output-dir generates a HuggingFace model directory (config.json + model.safetensors) ready to train.

v0.1.5 is a research framework for exploring new structural hypotheses without code changes. Adds μP scaling · differentiation objectives · tissue block as backward-compatible spec extensions (default OFF, v0.1.4 YAML unchanged) — letting you ablate cortical column · neural differentiation primitives to investigate early emergence of reasoning · faster abstraction · independent development of world-knowledge vs reasoning/abstraction at the architecture level.

presets

v0.1.5

current version

language CLI

tutorials

Tutorials (16) CLI Reference GitHub

What Cortical Structure Brings to LLMs

Inspired by the brain's cortical organization — can reasoning and abstraction grow independently within a single model?

🧠

Early emergence of reasoning

Introducing cortical hierarchy lets reasoning patterns appear earlier with less data and fewer parameters. EulerStack turns this hypothesis on/off with a single YAML line.

⚡

Faster abstraction

Abstraction and reasoning don't compete in the same layer. Separated hierarchies let the two capabilities develop faster without interfering with each other.

🔬

Independent development of knowledge ↔ reasoning

If world knowledge (memory) and reasoning/abstraction can be grown independently, the center of LLM research shifts from capital to structural design. EulerStack is that experimental environment.

Core architecture — atlas_diff

A 5-axis backbone declared in a single EulerStack YAML:

local_scan	Sliding window 128 (local processing)
temporal_integrator	Mamba2 (temporal integration)
global_binding_mla	MLA (global binding)
associative_memory_moe_diff	MoE + differentiation (associative memory)
reasoning	execution_modes (reasoning execution)

arch-scale (~hundreds of M params) · atlas_diff_200m.yml · in-house research projects/04 Neural Differentiation Study (in progress)

Observed differentiation axes

Knowledge	How language-modeling ability (perplexity, etc.) moves with data and scale.
Reasoning	Whether reasoning benchmarks improve independently of LM ability — how structural differences affect reasoning.
Differentiation level	The degree of representation/role differentiation across experts and columns — observed via expert-CKA, column-ablation, and gate statistics.
Investigation approach	Comparing alternatives such as hash routing, group-MoE, DSv4 affinity, and cortical dualmem by changing only the EulerStack spec.

Capital game → design game.
Which structural choice brings which capability — once this question can be answered, results are decided by the researcher's design, not by tens-of-billions-scale GPU budgets. EulerStack is the research framework where that hypothesis is being directly verified in-house right now.

※ Research is in progress; concrete numbers will be published in formal reports later. The v0.1.5 W-AS-1~3 spec extensions (μP scaling · differentiation_objectives · tissue blocks) make the above hypotheses expressible and reproducible as standard spec — all default OFF, v0.1.4 compatible.

Core Features

Layer Templates & Schedule

Define named layer templates (mixer + FFN + norm + residual) and use schedules to specify arrangement order and repetition counts.

Mixer Types	Attention, Mamba, RetNet, Hyena
FFN Types	MLP, Gated MLP (SwiGLU), MoE (top-k routing)
Norm	RMSNorm, LayerNorm (pre/post position)
Residual	Sequential, Parallel, Hyper-Connection (mHC)
Head	causal_lm, causal_lm_mtp (Multi-Token Prediction)

Validation & Realism

A 3-stage process — schema structure → cross-field compatibility → heuristic realism checks — catches design errors before compilation. Every error is printed in the 3-line format (Category: what / Fix: / See:).

Structure	Unknown keys, type/enum, required fields, positive constraints
Compatibility	Mixer↔state mismatches (e.g., mamba + kv_cache forbidden)
Realism	head_dim range (32–256), target_params mismatch (>30%), MoE expert ratio, seq_len/d_model ratio, family_hint consistency, vocab/tokenizer consistency, tie_weight consistency, rope_scaling bounds
Error Categories	`ValidationError`, `CompatibilityError`, `CompileError`, `NormalizationError`

Start with a single YAML

A declarative spec of ~10 lines fully describes the shape of a model.

schema_version: 1
model: { name: "my-llm", d_model: 2048, vocab_size: 32000, max_seq_len: 4096, n_heads: 16 }
tokenizer_contract: { type: hf, pretrained: gpt2 }
embedding: { type: learned, positional: rope }
layer_templates:
  decoder:
    mixer: { type: attention, attention: {} }
    ffn:   { type: gated_mlp, activation: swiglu }
layer_schedule:
  - { template: decoder, repeat: 24 }
head: { type: causal_lm }
            

v0.1.5 spec extensions (optional, default OFF) — μP scaling, differentiation auxiliary objectives, and the organ (tissue) declaration:

# Add the following to the spec above (existing YAML works unchanged)
training_hints:
  scaling: { parametrization: mup, base_width: 256 }   # μP (W-AS-1)
  differentiation_objectives: { usage_probe_coef: 0.01 } # differentiation aux objective (W-AS-2)
tissue:                                                  # organ/column declaration (W-AS-3)
  columns:
    - { name: global_integration, templates: [decoder], role: global_binding }
  connectivity: ring
            

Presets: 57 on a 3-tier learning path

Organized by the v1 "industrial ordering principle": validated industrial → recent hybrid/MoE → v1 experimental primitives. The order is the recommended learning path — start with what industry has already proven, move through recent hybrid/MoE research, then explore v1's new primitives (MLA / MoD / Titans / Dual-Stream) at arch-scale. 24 llm_ + 33 arch_ = 57 (33 arch_ = beginner 2 · intermediate 3 · advanced 5 · expert 23, of which 9 are *_mini). Presets are only starting points — edit d_model, n_heads, and layer count to assemble a model at any scale.

▶ See all 57 presets (3-tier hierarchy + llm_ × variant matrix)

Tier 1 — Validated industrial

Production-grade baselines. Training recipes are well-studied; failure modes are known. Works stably from 0.1B (Stage-1 / CPT warm-up) up to 16B.

Preset	~Params	One-liner	Research basis
`arch_beginner_gpt2`	~1.1B	Classic Transformer (MHA + LayerNorm post + GeLU)	Vaswani 2017, GPT-2
`arch_beginner_llama`	~1.1B	Modern baseline (GQA + RMSNorm pre + SwiGLU)	Llama 2/3
`arch_intermediate_mistral`	~1.3B	1 global : 3 sliding attention	Mistral 7B
`arch_intermediate_gemma2`	~1.3B	1:1 alternating global/local	Gemma 2
`arch_intermediate_qwen_longctx`	~1.3B	RoPE scaling factor 4, 32K ctx	Qwen 2/3
`llm_0p1b_{simple,mistral}`	~100M	Stage-1 / CPT warm-up	Sovereign-foundation pilot
`llm_*_simple` (0.8B–16B)	0.8B–16B	Pure attention (Llama)	—
`llm_*_mistral` (0.8B–16B)	0.8B–16B	Attention + sliding window	Mistral 7B

Tier 2 — Recent hybrid / MoE / long-context / KV-compressed

Modern compositions that are already in production somewhere. Sized for 24 GB GPU training. The expert level crosses MoE strategy × mixer × depth/receptive-field as a 3D design space; four entries are speculative compositions not yet in the literature.

Level	Preset	~Params	One-liner	Research basis
advanced	`arch_advanced_jamba`	~1.2B	Mamba + Attention 3:1 hybrid	Jamba-1.5 (AI21 2024)
advanced	`arch_advanced_samba`	~1.0B	Mamba + Sliding attention 1:1	Samba (Microsoft 2024)
advanced	`arch_advanced_retnet`	~1.3B	Pure RetNet (attention-free)	Sun 2023
advanced (v1 B2.1)	`arch_advanced_mla`	~1.1B	MLA — KV compressed via latent_dim	DeepSeek-V3 (2024)
advanced (v1 B3.1)	`arch_advanced_mod`	~1.1B	Mixture-of-Depths (token-level layer skip)	Raposo ICML 2024
expert	`arch_expert_research`	~1.5B	4 mixers + MoE 3-phase	Research-grade
expert	`arch_expert_mixtral_moe`	~1.9B	Pure attn + every-layer MoE (8 × top-2)	Mixtral 8x7B (2024)
expert	`arch_expert_striped_hyena`	~1.0B	Hyena + Attention 4:1, 128K	StripedHyena
expert	`arch_expert_blackmamba_moe`	~1.5B	Mamba + MoE (MoE on non-attn mixer)	BlackMamba, MoE-Mamba
expert	`arch_expert_deepseek_moe`	~2.0B	Fine-grained MoE (32 × top-3)	DeepSeek-V2/V3 (2024)
expert NEW	`arch_expert_dsv4_v3fallback`	~2.0B	DeepSeek-V4 schema (V3 fallback path)	DeepSeek-V3/V4
expert (speculative)	`arch_expert_retnet_moe`	~1.5B	RetNet + MoE (no paper)	Sun 2023 + MoE extrapolation
expert (speculative)	`arch_expert_frontier_full_moe`	~2.0B	Attention-free, multi-mixer + all-MoE (most speculative)	Composition prediction
expert (speculative)	`arch_expert_progressive_stack`	~1.5B	Depth-wise hyena→mamba→retnet→attn+MoE (no paper)	Hierarchical prediction
expert (speculative)	`arch_expert_dilated_longnet`	~2.0B	Temporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper)	Longnet + Jamba extrapolation
expert (capstone)	`arch_expert_kitchen_sink`	—	Combines every available primitive in one spec for max-surface validation	Aggregate validation

Tier 3 — v1 experimental primitives (Phase B at arch-scale)

Each preset showcases one Phase B primitive at arch-scale (~1.2–1.4B). Schema-complete; runtime is partial — the compiler falls back to a standard block for un-implemented mixers, but the full spec metadata round-trips via config.v1_extensions. The experience is "declare a Phase B primitive in YAML, compile, save as an HF custom model".

Preset	~Params	One-liner	Research basis
`arch_expert_reasoning_r1`	~1.3B	2-phase reasoning (think / answer)	DeepSeek-R1 (2025), Quiet-STaR
`arch_expert_titans_memory`	~1.2B	Parametric memory + test-time update	Titans (Google 2024–2025)
`arch_expert_dual_stream`	~1.4B	Monoidal parallel (Mamba ∥ Attention)	Jamba × PaLM generalization

`arch_expert_*_mini` — Small-scale speculative experts (9, ~80M–360M)

Mini variants of the speculative expert architectures. Same design ideas, dramatically smaller (d_model 384–512, ~12 layers) so that a full training ablation fits on a single consumer GPU. Intended for quickly iterating on architectural hypotheses before committing to a 2B training run. arch_expert_progressive_stack_mini is the recommended starting point.

Preset	~Total	~Active	Mirror of	Pedagogical role
`arch_expert_progressive_stack_mini`	~86M	~86M	`arch_expert_progressive_stack`	RECOMMENDED first experiment
`arch_expert_blackmamba_moe_mini`	~156M	~90M	`arch_expert_blackmamba_moe`	Partial-sparse MoE on SSM
`arch_expert_mixtral_moe_mini`	~175M	~90M	`arch_expert_mixtral_moe`	Classic every-layer MoE baseline
`arch_expert_dilated_longnet_mini`	~83M	~75M	`arch_expert_dilated_longnet`	Long-context temporal pyramid
`arch_expert_deepseek_moe_mini`	~357M	~60M	`arch_expert_deepseek_moe`	⚠ Observe fine-grained MoE failure
`arch_expert_frontier_full_moe_mini`	~106M	~60M	`arch_expert_frontier_full_moe`	⚠ Most experimental; expected to fail
`arch_expert_dsv4_flash_mini` NEW	~180M	~70M	DeepSeek-V4	DSv4 + Flash/NSA compressed attention
`arch_expert_dsv4_subset_mini` NEW	~180M	~70M	DeepSeek-V4	DSv4 feature subset
`arch_expert_mhc_moe_mini` NEW	~150M	~70M	mHC + MoE	multi-Hyper-Connection residual + MoE

`llm_` — Size × Architectural Variant (24)

5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × 5 variants (simple / mistral / jamba / moe / mla). moe is omitted at 0.1B.

Scale	simple	mistral	jamba	moe	mla
0.1B	`llm_0p1b_simple`	`llm_0p1b_mistral`	`llm_0p1b_jamba`	—	`llm_0p1b_mla`
0.8B	`llm_0p8b_simple`	`llm_0p8b_mistral`	`llm_0p8b_jamba`	`llm_0p8b_moe`	`llm_0p8b_mla`
2B	`llm_2b_simple`	`llm_2b_mistral`	`llm_2b_jamba`	`llm_2b_moe`	`llm_2b_mla`
4B	`llm_4b_simple`	`llm_4b_mistral`	`llm_4b_jamba`	`llm_4b_moe`	`llm_4b_mla`
16B	`llm_16b_simple`	`llm_16b_mistral`	`llm_16b_jamba`	`llm_16b_moe`	`llm_16b_mla`

Variant semantics: simple = pure attention (Llama) · mistral = attention + sliding window (1 global : 3 sliding per 4 layers) · jamba = Mamba + Attention hybrid (3:1) · moe = attention + MoE FFN (1-in-4 layers, 8 experts, top-2) · mla = Multi-head Latent Attention (DeepSeek-V3 style KV compression).

No upper limit — presets are starting points. EulerStack can assemble a model of any size by editing d_model, n_heads, and layer count.

CLI Reference

Follows the eulerwa CLI family convention. All errors are printed in the 3-line format (Category: what / Fix: / See:).

Top-Level Commands

`validate`	Validate a YAML spec (`--report` includes realism checks)
`explain`	Human-readable model summary (layers, parameter estimate)
`compile`	IR → JSON runtime config (`--output`) or HF model directory (`--output-dir`)
`schema`	Print YAML schema structure
`presets list` / `show`	Enumerate presets or show details for one

Common Options

`--lang`	Output language (ko/en/zh/ja/es). Root option; default ko
`--preset`	YAML spec file path
`--validate-only`	Validate and exit without further work
`--output / -o`	JSON runtime config output path
`--output-dir`	HF model directory output (config.json + model.safetensors)
`--print-config` / `--dry-run`	Print resolved config to stdout

5-language i18n CLI

Every CLI help page, log message, warning, and error is translated into ko / en / zh / ja / es. Default language is Korean (ko); switch via the --lang root option or the EULERSTACK_LANG environment variable. Command names, option names, and the Fix: / See: labels in the 3-line error format stay untranslated for script compatibility.

eulerstack validate --preset my_model.yml
# Korean (default)

eulerstack --lang en validate --preset my_model.yml
# English

EULERSTACK_LANG=ja eulerstack validate --preset my_model.yml
# env var also works

HF model directory → EulerForge training

compile --output-dir writes a HuggingFace-compatible directory (config.json + model.safetensors) — the primary handoff path into the EulerForge training pipeline.

eulerstack compile --preset my_model.yml --output-dir ./my_model

# Load it from Python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("./my_model", trust_remote_code=True)

5-Layer Architecture

From YAML spec to a trainable model — 5 layers with strict separation of concerns.

Layer 1: DSL	User-authored YAML spec (schema_version 1, declarative model definition)
Layer 2: Schema	Structural validation — unknown keys, type/enum, required fields, cross-field compatibility
Layer 3: IR	Normalized canonical structure (default fills, template expansion)
Layer 4: Compiler	IR → JSON runtime config or HF model directory (config.json + model.safetensors) — loadable via `AutoModelForCausalLM.from_pretrained()` for EulerForge training
Layer 5: CLI	`validate` / `explain` / `compile` / `schema` / `presets` — all messages i18n-translated across 5 languages

Tutorials

Tutorials are maintained in Korean (ko) and English (en) and are available directly on this site under /en/products/eulerstack/tutorials/ (upstream path: docs/tutorials/{ko,en}/).

Core Tutorials (11)

`00_positioning`	Read first — where EulerStack fits: an Architecture Description Language (ADL) for LLMs
`01_validate_a_spec`	Validate a YAML spec
`02_use_presets`	Use presets
`03_spec_reference`	Spec reference
`04_compile_and_explain`	Compile & explain
`05_prepare_data`	Prepare training data
`06_sanity_train`	Sanity training loop
`07_arch_walkthrough`	Skill-level architecture walkthrough (tour of the `arch_` presets)
`08_expert_mini_walkthrough`	Expert mini preset walkthrough (single-GPU ablation)
`09_new_primitives_walkthrough`	NEW — v1 Phase B primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)
`10_paper_to_yaml`	NEW — paper → YAML case studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans)

Mixer deep dives (`mixers/`, 5)

`00_overview`	Mixers concept — why mix attention / mamba / retnet / hyena
`01_attention`	Attention in depth
`02_mamba`	Mamba in depth
`03_retnet`	RetNet in depth
`04_hyena`	Hyena in depth

Install & Quickstart

Install

pip install -e .

# or include dev dependencies
pip install -e ".[dev]"

Quickstart

# List presets (Korean default)
eulerstack presets list

# Validate with realism report
eulerstack validate --preset my_model.yml --report

# Build an HF model directory → hand off to EulerForge training
eulerstack compile --preset my_model.yml --output-dir ./my_model

# Switch CLI messages to English
eulerstack --lang en validate --preset my_model.yml

Design LLM Architectures with EulerStack

Combine Attention, Mamba, RetNet, Hyena, and MoE into hybrid models with a single YAML spec — then hand off the HuggingFace model directory to EulerForge for training.

Get Started on GitHub