2. Use Presets

CLI messages are translated into ko / en / zh / ja / es. Use eulerstack --lang en ... (or EULERSTACK_LANG=en) for English. Default is Korean.

This tutorial describes how to explore and use EulerStack v1's 52 presets. Learning this first is the efficient path before attempting to design a new architecture from scratch.

Learning path — the v1 3-tier approach

Presets are organized as "validated first, experimental last". The same ordering works for production adoption, research exploration, and verification of new v1 primitives.

Tier 1 — Validated industrial (learn first)
  Llama 2/3, Mistral, Gemma 2, Qwen longctx families. Conservative baselines
  with well-known failure modes and stable training recipes. Scale variants
  llm_0p1b_{simple,mistral} + llm_*_{simple,mistral} (0.8B–16B) plus
  arch_beginner_* / arch_intermediate_*.

Tier 2 — Recent / complex (next)
  Jamba hybrids, Mixtral / DeepSeek MoE, Samba, RetNet, MLA (DeepSeek-V3 2024),
  MoD (Raposo ICML 2024), expert compositions (4 speculative).
  arch_advanced_{jamba,samba,retnet,mla,mod}, arch_expert_*,
  arch_expert_*_mini, llm_*_{jamba,moe,mla}.

Tier 3 — v1 experimental (last — four Phase-B primitive demos at arch-scale)
  arch_expert_reasoning_r1   — 2-phase reasoning (DeepSeek-R1 2025)
  arch_expert_titans_memory  — Titans parametric memory (Google 2024-2025)
  arch_expert_dual_stream    — monoidal parallel (Jamba × PaLM generalisation)
  arch_expert_kitchen_sink   — **every v1 primitive in ONE spec (capstone)**:
                               MLA + Titans + MoE + branched + TTT + ODE-RK4
                               + parallel + discrete integrator + MoD +
                               per-layer override + let + reserved namespaces
                               + execution_modes. Proves end-to-end that
                               validate → compile → save/load → HF training
                               survives combining everything at once. ~5.5M
                               params, ~20 layers.

  v1.1 runtime-maturity sprint status: **MLA / MoD / Titans memory runtimes
  are Core** (Titans was promoted from plugin-track to core in v1.1),
  reasoning_r1 is metadata-Core, dual_stream is a Component. `ode_rk4` /
  `ode_euler` graduated from reserved to core, and TTT ships as the
  reference plugin `eulerstack.plugins.ttt` providing a real meta-learning
  loop. Full matrix: docs/architectures/runtime_primitive_status.md.

0.1B class (new): for sovereign-foundation Stage-1 / CPT warm-up, ~100M parameters. MoE is skipped at this scale because 8 experts × 0.1B gives ~12M active params — not useful for routing learning. 4 variants shipped: simple / mistral / jamba / mla.

Reading in this order lands naturally: validated baselines → recent research recipes in production → new primitives only v1 can express.

Why Presets

Designing an LLM architecture from scratch means balancing more than 20 interdependent parameters simultaneously, including:

Basic dimensions: d_model, n_heads, n_kv_heads, mlp_ratio, n_layers
Mixer choice: attention / mamba / retnet / hyena
FFN choice: mlp / gated_mlp / moe, plus MoE routing details
Positional encoding: rope / alibi / learned, with rope_theta and scaling
Normalization: rmsnorm / layernorm, pre / post placement
KV cache / SSM state wiring

Getting all of these to agree is hard for beginners and still error-prone for experienced users. EulerStack's presets come pre-balanced: they have already been validated, compiled, and parameter-estimated. The most efficient starting point for most users is to pick a preset and tweak small pieces rather than build from empty YAML.

Three Axes of Presets

Presets are organized on three axes.

arch_ — skill-level walkthrough (20 presets, ~1–2B) A guided tour from beginner to expert that reproduces the evolution of modern LLM architectures. Each level includes several competing approaches so you can compare the effect of architectural choices under a fixed parameter budget. v1 adds 2 at the advanced level (MLA, MoD) and 3 at the expert level (Reasoning R1, Titans memory, Dual-stream).
llm_ — production (24 presets, 0.1B–16B) 5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla). 0.1B is for Stage-1 / CPT warm-up; MoE is skipped at that scale. These are the presets you typically use as starters for real services or projects.
arch_expert_*_mini — small-scale experts (6 presets, ~80M–360M) Expert-level architectures shrunk down so that ablation experiments fit on a single consumer GPU. Same design ideas, smaller scale.

Each axis is covered in turn below.

1. arch_ Presets (Skill-Level Walkthrough, 20 presets)

Each level deliberately includes competing approaches so you can compare alternatives side by side. To understand why one choice is better than another, you need both options in front of you. A full step-by-step tour is in 07_arch_walkthrough.md, which is the recommended read for this part.

Level	Preset	One-liner	Research basis
beginner	`arch_beginner_gpt2`	Classic Transformer (MHA + LayerNorm post + GeLU)	Vaswani 2017, GPT-2
beginner	`arch_beginner_llama`	Modern baseline (GQA + RMSNorm pre + SwiGLU)	Llama 2/3
intermediate	`arch_intermediate_mistral`	1 global : 3 sliding attention	Mistral 7B
intermediate	`arch_intermediate_gemma2`	1:1 alternating global/local	Gemma 2
intermediate	`arch_intermediate_qwen_longctx`	RoPE scaling (factor 4, 32K)	Qwen 2/3
advanced	`arch_advanced_jamba`	Mamba + Attention 3:1 hybrid	Jamba-1.5 (AI21)
advanced	`arch_advanced_samba`	Mamba + Sliding attention 1:1	Samba (Microsoft)
advanced	`arch_advanced_retnet`	Pure RetNet (attention-free)	Sun et al. 2023
expert	`arch_expert_research`	4 mixers + MoE 3-phase	Research-grade
expert	`arch_expert_mixtral_moe`	Pure attn + every-layer MoE (8 × top-2)	Mixtral 8x7B
expert	`arch_expert_striped_hyena`	Hyena + Attention 4:1, 128K context	StripedHyena
expert	`arch_expert_blackmamba_moe`	Mamba + MoE (MoE on non-attn mixer)	BlackMamba, MoE-Mamba
expert	`arch_expert_deepseek_moe`	Fine-grained MoE (32 × top-3)	DeepSeek-V2/V3
expert	`arch_expert_retnet_moe`	RetNet + MoE (predicted, no paper)	Sun 2023 + MoE extrapolation
expert	`arch_expert_frontier_full_moe`	Attention-free, multi-mixer + all-MoE (most speculative)	Composition prediction
expert	`arch_expert_progressive_stack`	Depth-wise: hyena→mamba→retnet→attn+MoE (no paper)	Hierarchical prediction
expert	`arch_expert_dilated_longnet`	Temporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper)	Longnet + Jamba extrapolation

Most target_params values are in the 1–2B range, allowing direct comparison of which architecture delivers what advantage at the same budget.

Core Ideas in the Level Progression

From beginner to intermediate — introducing sliding windows. arch_beginner_llama uses O(N²) global attention at every layer, meaning cost grows quadratically with sequence length. Intermediate-level arch_intermediate_mistral keeps only 1-in-4 layers as global attention and restricts the remaining 3 to 4,096-token sliding windows. This shrinks the total attention KV cache by roughly 4× and makes long-context processing dramatically cheaper.

From intermediate to advanced — replacing attention with Mamba. arch_advanced_jamba replaces 75% of layers with Mamba2 SSM. Mamba is O(N), so the quadratic bottleneck disappears. The remaining 25% attention acts as a "retrieval anchor" that preserves in-context learning. As a result, 32K-token long-context becomes practical, and inference throughput goes up substantially.

From advanced to expert — combining multiple mixers with MoE. The flagship expert preset arch_expert_research mixes all 4 EulerStack mixer types in one stack and adds MoE FFN to attention layers. The design splits the work into three phases by depth:

Phase 1 (layers 1-8): mamba + hyena — efficient bulk processing
Phase 2 (layers 9-24): mamba + retnet + attention/MoE — deep reasoning core
Phase 3 (layers 25-32): attention/MoE + retnet — output refinement and recall anchors

This kind of combination is aimed less at direct production deployment and more at cheaply testing the question "which mixer + MoE combination works best?"

Per-mixer deep dives live under mixers/.

2. llm_ Presets (Production Deployment, 24 presets)

5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla) = 24 presets (MoE is skipped at 0.1B — not meaningful at that scale). What each variant is for:

Variant	Description	When to use
`simple`	Pure attention (Llama-style)	Most universal, best HF tooling compatibility
`mistral`	Attention + sliding window	KV cache savings, medium context (8–16K)
`jamba`	Mamba + Attention hybrid	Long-context priority (32K+), inference speed
`moe`	Attention + MoE FFN (1-in-4)	Maximum model capacity for a given active compute

Size / variant matrix:

Scale	simple	mistral	jamba	moe
0.8B	`llm_0p8b_simple`	`llm_0p8b_mistral`	`llm_0p8b_jamba`	`llm_0p8b_moe`
2B	`llm_2b_simple`	`llm_2b_mistral`	`llm_2b_jamba`	`llm_2b_moe`
4B	`llm_4b_simple`	`llm_4b_mistral`	`llm_4b_jamba`	`llm_4b_moe`
16B	`llm_16b_simple`	`llm_16b_mistral`	`llm_16b_jamba`	`llm_16b_moe`

Why MoE variants have a smaller d_model

The d_model of MoE variants is intentionally about 65% of the matching simple variant. MoE layers carry multiple expert FFNs, so keeping the same d_model would inflate the total parameter count. Reducing d_model keeps total parameters aligned with the named size (0.8B, 2B, etc.). As a result, active parameters (the ones each token actually uses) end up at about 1/4 of total parameters.

Size is not capped

Presets are only starting points. EulerStack can assemble models at 70B, 100B, or beyond by editing d_model and n_layers. Copy an existing preset and adjust freely.

3. arch_expert_*_mini Presets (Small-Scale Expert Experiments, 6 presets)

Expert-level presets are typically ~2B, which is heavy on a single consumer GPU. arch_expert_*_mini keeps the same design ideas but shrinks them to ~12 layers and d_model 384–512 — small enough to train on a 12–16 GB GPU.

Preset	~Total params	Key point
`arch_expert_progressive_stack_mini`	~86M	Run this first. Zone separation is cleanest
`arch_expert_blackmamba_moe_mini`	~156M	Partial sparse MoE — most plausible at mini scale
`arch_expert_mixtral_moe_mini`	~175M	Classic MoE baseline
`arch_expert_dilated_longnet_mini`	~83M	Long-context pyramid (512→2K→8K)
`arch_expert_deepseek_moe_mini`	~357M	Observe fine-grained MoE failure
`arch_expert_frontier_full_moe_mini`	~106M	Attention-free + all-MoE, most experimental

08_expert_mini_walkthrough.md covers each mini's design rationale and trade-offs in detail.

Browsing and Using Presets

The first command to run after installation is the preset listing.

# Full list with parameter estimates
eulerstack presets list

# Details for a single preset
eulerstack presets show arch_advanced_jamba
eulerstack presets show llm_4b_moe

# Switch language
eulerstack --lang ko presets show llm_2b_jamba

presets list produces a tabular listing of names, estimated parameter counts, and one-line summaries. presets show prints the layer schedule, mixer config, and MoE config for a specific preset.

Validate → Compile → Export Pipeline

A typical end-to-end flow with a preset looks like this:

# 1) Validate (with realism report)
eulerstack validate --preset configs/presets/arch_advanced_jamba.yml --report

# 2) Inspect structure
eulerstack explain --preset configs/presets/arch_advanced_jamba.yml

# 3) Compile to JSON (inspection / debugging)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --print-config

# 4) Export as HF model directory (training-ready)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --output-dir ./my_jamba

Each step is covered in more depth in Tutorial 4: Compile and Explain.

Loading in Python

The exported HuggingFace model directory loads through the standard AutoModelForCausalLM interface.

from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes

register_eulerstack_auto_classes()

# Same API as loading Llama / Mistral
model = AutoModelForCausalLM.from_pretrained("./my_jamba", trust_remote_code=True)
print(model.config.d_model, model.config.n_layers)

Runnable end-to-end scripts are in the examples/ directory.

examples/01_compile_and_export.py
examples/02_load_and_generate.py
examples/03_architecture_evolution.py — compares all 4 arch_ preset levels

Next Steps

Tutorial 4: Compile and Explain — explain / compile details
Tutorial 6: Sanity Train — confirm the model actually learns
07_arch_walkthrough.md — 20 arch_ presets in depth (beginner 2 · intermediate 3 · advanced 5 · expert 10)
08_expert_mini_walkthrough.md — 6 expert mini presets in depth
09_new_primitives_walkthrough.md — step-by- step for v1 Phase-B primitives (MLA / R1 reasoning / Titans memory / MoD / dual-stream, …)
Mixer overview — when each mixer is preferable

← Prev 1. Validate a Spec 3. Spec Reference Next →