2. Use Presets
CLI messages are translated into ko / en / zh / ja / es. Use
eulerstack --lang en ...(orEULERSTACK_LANG=en) for English. Default is Korean.
This tutorial describes how to explore and use EulerStack v1's 52 presets. Learning this first is the efficient path before attempting to design a new architecture from scratch.
Learning path — the v1 3-tier approach
Presets are organized as "validated first, experimental last". The same ordering works for production adoption, research exploration, and verification of new v1 primitives.
Tier 1 — Validated industrial (learn first)
Llama 2/3, Mistral, Gemma 2, Qwen longctx families. Conservative baselines
with well-known failure modes and stable training recipes. Scale variants
llm_0p1b_{simple,mistral} + llm_*_{simple,mistral} (0.8B–16B) plus
arch_beginner_* / arch_intermediate_*.
Tier 2 — Recent / complex (next)
Jamba hybrids, Mixtral / DeepSeek MoE, Samba, RetNet, MLA (DeepSeek-V3 2024),
MoD (Raposo ICML 2024), expert compositions (4 speculative).
arch_advanced_{jamba,samba,retnet,mla,mod}, arch_expert_*,
arch_expert_*_mini, llm_*_{jamba,moe,mla}.
Tier 3 — v1 experimental (last — four Phase-B primitive demos at arch-scale)
arch_expert_reasoning_r1 — 2-phase reasoning (DeepSeek-R1 2025)
arch_expert_titans_memory — Titans parametric memory (Google 2024-2025)
arch_expert_dual_stream — monoidal parallel (Jamba × PaLM generalisation)
arch_expert_kitchen_sink — **every v1 primitive in ONE spec (capstone)**:
MLA + Titans + MoE + branched + TTT + ODE-RK4
+ parallel + discrete integrator + MoD +
per-layer override + let + reserved namespaces
+ execution_modes. Proves end-to-end that
validate → compile → save/load → HF training
survives combining everything at once. ~5.5M
params, ~20 layers.
v1.1 runtime-maturity sprint status: **MLA / MoD / Titans memory runtimes
are Core** (Titans was promoted from plugin-track to core in v1.1),
reasoning_r1 is metadata-Core, dual_stream is a Component. `ode_rk4` /
`ode_euler` graduated from reserved to core, and TTT ships as the
reference plugin `eulerstack.plugins.ttt` providing a real meta-learning
loop. Full matrix: docs/architectures/runtime_primitive_status.md.
0.1B class (new): for sovereign-foundation Stage-1 / CPT warm-up, ~100M parameters. MoE is skipped at this scale because 8 experts × 0.1B gives ~12M active params — not useful for routing learning. 4 variants shipped: simple / mistral / jamba / mla.
Reading in this order lands naturally: validated baselines → recent research recipes in production → new primitives only v1 can express.
Why Presets
Designing an LLM architecture from scratch means balancing more than 20 interdependent parameters simultaneously, including:
- Basic dimensions:
d_model,n_heads,n_kv_heads,mlp_ratio,n_layers - Mixer choice: attention / mamba / retnet / hyena
- FFN choice: mlp / gated_mlp / moe, plus MoE routing details
- Positional encoding: rope / alibi / learned, with rope_theta and scaling
- Normalization: rmsnorm / layernorm, pre / post placement
- KV cache / SSM state wiring
Getting all of these to agree is hard for beginners and still error-prone for experienced users. EulerStack's presets come pre-balanced: they have already been validated, compiled, and parameter-estimated. The most efficient starting point for most users is to pick a preset and tweak small pieces rather than build from empty YAML.
Three Axes of Presets
Presets are organized on three axes.
-
arch_— skill-level walkthrough (20 presets, ~1–2B) A guided tour from beginner to expert that reproduces the evolution of modern LLM architectures. Each level includes several competing approaches so you can compare the effect of architectural choices under a fixed parameter budget. v1 adds 2 at the advanced level (MLA, MoD) and 3 at the expert level (Reasoning R1, Titans memory, Dual-stream). -
llm_— production (24 presets, 0.1B–16B) 5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla). 0.1B is for Stage-1 / CPT warm-up; MoE is skipped at that scale. These are the presets you typically use as starters for real services or projects. -
arch_expert_*_mini— small-scale experts (6 presets, ~80M–360M) Expert-level architectures shrunk down so that ablation experiments fit on a single consumer GPU. Same design ideas, smaller scale.
Each axis is covered in turn below.
1. arch_ Presets (Skill-Level Walkthrough, 20 presets)
Each level deliberately includes competing approaches so you can compare alternatives side by side. To understand why one choice is better than another, you need both options in front of you. A full step-by-step tour is in 07_arch_walkthrough.md, which is the recommended read for this part.
| Level | Preset | One-liner | Research basis |
|---|---|---|---|
| beginner | arch_beginner_gpt2 |
Classic Transformer (MHA + LayerNorm post + GeLU) | Vaswani 2017, GPT-2 |
| beginner | arch_beginner_llama |
Modern baseline (GQA + RMSNorm pre + SwiGLU) | Llama 2/3 |
| intermediate | arch_intermediate_mistral |
1 global : 3 sliding attention | Mistral 7B |
| intermediate | arch_intermediate_gemma2 |
1:1 alternating global/local | Gemma 2 |
| intermediate | arch_intermediate_qwen_longctx |
RoPE scaling (factor 4, 32K) | Qwen 2/3 |
| advanced | arch_advanced_jamba |
Mamba + Attention 3:1 hybrid | Jamba-1.5 (AI21) |
| advanced | arch_advanced_samba |
Mamba + Sliding attention 1:1 | Samba (Microsoft) |
| advanced | arch_advanced_retnet |
Pure RetNet (attention-free) | Sun et al. 2023 |
| expert | arch_expert_research |
4 mixers + MoE 3-phase | Research-grade |
| expert | arch_expert_mixtral_moe |
Pure attn + every-layer MoE (8 × top-2) | Mixtral 8x7B |
| expert | arch_expert_striped_hyena |
Hyena + Attention 4:1, 128K context | StripedHyena |
| expert | arch_expert_blackmamba_moe |
Mamba + MoE (MoE on non-attn mixer) | BlackMamba, MoE-Mamba |
| expert | arch_expert_deepseek_moe |
Fine-grained MoE (32 × top-3) | DeepSeek-V2/V3 |
| expert | arch_expert_retnet_moe |
RetNet + MoE (predicted, no paper) | Sun 2023 + MoE extrapolation |
| expert | arch_expert_frontier_full_moe |
Attention-free, multi-mixer + all-MoE (most speculative) | Composition prediction |
| expert | arch_expert_progressive_stack |
Depth-wise: hyena→mamba→retnet→attn+MoE (no paper) | Hierarchical prediction |
| expert | arch_expert_dilated_longnet |
Temporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper) | Longnet + Jamba extrapolation |
Most target_params values are in the 1–2B range, allowing direct comparison of
which architecture delivers what advantage at the same budget.
Core Ideas in the Level Progression
From beginner to intermediate — introducing sliding windows.
arch_beginner_llama uses O(N²) global attention at every layer, meaning cost
grows quadratically with sequence length. Intermediate-level
arch_intermediate_mistral keeps only 1-in-4 layers as global attention and
restricts the remaining 3 to 4,096-token sliding windows. This shrinks the total
attention KV cache by roughly 4× and makes long-context processing dramatically
cheaper.
From intermediate to advanced — replacing attention with Mamba.
arch_advanced_jamba replaces 75% of layers with Mamba2 SSM. Mamba is O(N),
so the quadratic bottleneck disappears. The remaining 25% attention acts as a
"retrieval anchor" that preserves in-context learning. As a result, 32K-token
long-context becomes practical, and inference throughput goes up substantially.
From advanced to expert — combining multiple mixers with MoE.
The flagship expert preset arch_expert_research mixes all 4 EulerStack mixer
types in one stack and adds MoE FFN to attention layers. The design splits the
work into three phases by depth:
- Phase 1 (layers 1-8): mamba + hyena — efficient bulk processing
- Phase 2 (layers 9-24): mamba + retnet + attention/MoE — deep reasoning core
- Phase 3 (layers 25-32): attention/MoE + retnet — output refinement and recall anchors
This kind of combination is aimed less at direct production deployment and more at cheaply testing the question "which mixer + MoE combination works best?"
Per-mixer deep dives live under mixers/.
2. llm_ Presets (Production Deployment, 24 presets)
5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla) = 24 presets (MoE is skipped at 0.1B — not meaningful at that scale). What each variant is for:
| Variant | Description | When to use |
|---|---|---|
simple |
Pure attention (Llama-style) | Most universal, best HF tooling compatibility |
mistral |
Attention + sliding window | KV cache savings, medium context (8–16K) |
jamba |
Mamba + Attention hybrid | Long-context priority (32K+), inference speed |
moe |
Attention + MoE FFN (1-in-4) | Maximum model capacity for a given active compute |
Size / variant matrix:
| Scale | simple | mistral | jamba | moe |
|---|---|---|---|---|
| 0.8B | llm_0p8b_simple |
llm_0p8b_mistral |
llm_0p8b_jamba |
llm_0p8b_moe |
| 2B | llm_2b_simple |
llm_2b_mistral |
llm_2b_jamba |
llm_2b_moe |
| 4B | llm_4b_simple |
llm_4b_mistral |
llm_4b_jamba |
llm_4b_moe |
| 16B | llm_16b_simple |
llm_16b_mistral |
llm_16b_jamba |
llm_16b_moe |
Why MoE variants have a smaller d_model
The d_model of MoE variants is intentionally about 65% of the matching
simple variant. MoE layers carry multiple expert FFNs, so keeping the same
d_model would inflate the total parameter count. Reducing d_model
keeps total parameters aligned with the named size (0.8B, 2B, etc.). As a
result, active parameters (the ones each token actually uses) end up at about
1/4 of total parameters.
Size is not capped
Presets are only starting points. EulerStack can assemble models at 70B, 100B,
or beyond by editing d_model and n_layers. Copy an existing preset and
adjust freely.
3. arch_expert_*_mini Presets (Small-Scale Expert Experiments, 6 presets)
Expert-level presets are typically ~2B, which is heavy on a single consumer GPU.
arch_expert_*_mini keeps the same design ideas but shrinks them to ~12 layers
and d_model 384–512 — small enough to train on a 12–16 GB GPU.
| Preset | ~Total params | Key point |
|---|---|---|
arch_expert_progressive_stack_mini |
~86M | Run this first. Zone separation is cleanest |
arch_expert_blackmamba_moe_mini |
~156M | Partial sparse MoE — most plausible at mini scale |
arch_expert_mixtral_moe_mini |
~175M | Classic MoE baseline |
arch_expert_dilated_longnet_mini |
~83M | Long-context pyramid (512→2K→8K) |
arch_expert_deepseek_moe_mini |
~357M | Observe fine-grained MoE failure |
arch_expert_frontier_full_moe_mini |
~106M | Attention-free + all-MoE, most experimental |
08_expert_mini_walkthrough.md covers each mini's design rationale and trade-offs in detail.
Browsing and Using Presets
The first command to run after installation is the preset listing.
# Full list with parameter estimates
eulerstack presets list
# Details for a single preset
eulerstack presets show arch_advanced_jamba
eulerstack presets show llm_4b_moe
# Switch language
eulerstack --lang ko presets show llm_2b_jamba
presets list produces a tabular listing of names, estimated parameter counts,
and one-line summaries. presets show prints the layer schedule, mixer config,
and MoE config for a specific preset.
Validate → Compile → Export Pipeline
A typical end-to-end flow with a preset looks like this:
# 1) Validate (with realism report)
eulerstack validate --preset configs/presets/arch_advanced_jamba.yml --report
# 2) Inspect structure
eulerstack explain --preset configs/presets/arch_advanced_jamba.yml
# 3) Compile to JSON (inspection / debugging)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --print-config
# 4) Export as HF model directory (training-ready)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --output-dir ./my_jamba
Each step is covered in more depth in Tutorial 4: Compile and Explain.
Loading in Python
The exported HuggingFace model directory loads through the standard
AutoModelForCausalLM interface.
from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes
register_eulerstack_auto_classes()
# Same API as loading Llama / Mistral
model = AutoModelForCausalLM.from_pretrained("./my_jamba", trust_remote_code=True)
print(model.config.d_model, model.config.n_layers)
Runnable end-to-end scripts are in the examples/ directory.
examples/01_compile_and_export.pyexamples/02_load_and_generate.pyexamples/03_architecture_evolution.py— compares all 4arch_preset levels
Next Steps
- Tutorial 4: Compile and Explain — explain / compile details
- Tutorial 6: Sanity Train — confirm the model actually learns
- 07_arch_walkthrough.md — 20 arch_ presets in depth (beginner 2 · intermediate 3 · advanced 5 · expert 10)
- 08_expert_mini_walkthrough.md — 6 expert mini presets in depth
- 09_new_primitives_walkthrough.md — step-by- step for v1 Phase-B primitives (MLA / R1 reasoning / Titans memory / MoD / dual-stream, …)
- Mixer overview — when each mixer is preferable