Home > EulerStack > Tutorials > 2. Use Presets

2. Use Presets

CLI messages are translated into ko / en / zh / ja / es. Use eulerstack --lang en ... (or EULERSTACK_LANG=en) for English. Default is Korean.

This tutorial describes how to explore and use EulerStack v1's 52 presets. Learning this first is the efficient path before attempting to design a new architecture from scratch.

Learning path — the v1 3-tier approach

Presets are organized as "validated first, experimental last". The same ordering works for production adoption, research exploration, and verification of new v1 primitives.

Tier 1 — Validated industrial (learn first)
  Llama 2/3, Mistral, Gemma 2, Qwen longctx families. Conservative baselines
  with well-known failure modes and stable training recipes. Scale variants
  llm_0p1b_{simple,mistral} + llm_*_{simple,mistral} (0.8B–16B) plus
  arch_beginner_* / arch_intermediate_*.

Tier 2 — Recent / complex (next)
  Jamba hybrids, Mixtral / DeepSeek MoE, Samba, RetNet, MLA (DeepSeek-V3 2024),
  MoD (Raposo ICML 2024), expert compositions (4 speculative).
  arch_advanced_{jamba,samba,retnet,mla,mod}, arch_expert_*,
  arch_expert_*_mini, llm_*_{jamba,moe,mla}.

Tier 3 — v1 experimental (last — four Phase-B primitive demos at arch-scale)
  arch_expert_reasoning_r1   — 2-phase reasoning (DeepSeek-R1 2025)
  arch_expert_titans_memory  — Titans parametric memory (Google 2024-2025)
  arch_expert_dual_stream    — monoidal parallel (Jamba × PaLM generalisation)
  arch_expert_kitchen_sink   — **every v1 primitive in ONE spec (capstone)**:
                               MLA + Titans + MoE + branched + TTT + ODE-RK4
                               + parallel + discrete integrator + MoD +
                               per-layer override + let + reserved namespaces
                               + execution_modes. Proves end-to-end that
                               validate → compile → save/load → HF training
                               survives combining everything at once. ~5.5M
                               params, ~20 layers.

  v1.1 runtime-maturity sprint status: **MLA / MoD / Titans memory runtimes
  are Core** (Titans was promoted from plugin-track to core in v1.1),
  reasoning_r1 is metadata-Core, dual_stream is a Component. `ode_rk4` /
  `ode_euler` graduated from reserved to core, and TTT ships as the
  reference plugin `eulerstack.plugins.ttt` providing a real meta-learning
  loop. Full matrix: docs/architectures/runtime_primitive_status.md.

0.1B class (new): for sovereign-foundation Stage-1 / CPT warm-up, ~100M parameters. MoE is skipped at this scale because 8 experts × 0.1B gives ~12M active params — not useful for routing learning. 4 variants shipped: simple / mistral / jamba / mla.

Reading in this order lands naturally: validated baselines → recent research recipes in production → new primitives only v1 can express.

Why Presets

Designing an LLM architecture from scratch means balancing more than 20 interdependent parameters simultaneously, including:

Getting all of these to agree is hard for beginners and still error-prone for experienced users. EulerStack's presets come pre-balanced: they have already been validated, compiled, and parameter-estimated. The most efficient starting point for most users is to pick a preset and tweak small pieces rather than build from empty YAML.

Three Axes of Presets

Presets are organized on three axes.

  1. arch_ — skill-level walkthrough (20 presets, ~1–2B) A guided tour from beginner to expert that reproduces the evolution of modern LLM architectures. Each level includes several competing approaches so you can compare the effect of architectural choices under a fixed parameter budget. v1 adds 2 at the advanced level (MLA, MoD) and 3 at the expert level (Reasoning R1, Titans memory, Dual-stream).

  2. llm_ — production (24 presets, 0.1B–16B) 5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla). 0.1B is for Stage-1 / CPT warm-up; MoE is skipped at that scale. These are the presets you typically use as starters for real services or projects.

  3. arch_expert_*_mini — small-scale experts (6 presets, ~80M–360M) Expert-level architectures shrunk down so that ablation experiments fit on a single consumer GPU. Same design ideas, smaller scale.

Each axis is covered in turn below.

1. arch_ Presets (Skill-Level Walkthrough, 20 presets)

Each level deliberately includes competing approaches so you can compare alternatives side by side. To understand why one choice is better than another, you need both options in front of you. A full step-by-step tour is in 07_arch_walkthrough.md, which is the recommended read for this part.

Level Preset One-liner Research basis
beginner arch_beginner_gpt2 Classic Transformer (MHA + LayerNorm post + GeLU) Vaswani 2017, GPT-2
beginner arch_beginner_llama Modern baseline (GQA + RMSNorm pre + SwiGLU) Llama 2/3
intermediate arch_intermediate_mistral 1 global : 3 sliding attention Mistral 7B
intermediate arch_intermediate_gemma2 1:1 alternating global/local Gemma 2
intermediate arch_intermediate_qwen_longctx RoPE scaling (factor 4, 32K) Qwen 2/3
advanced arch_advanced_jamba Mamba + Attention 3:1 hybrid Jamba-1.5 (AI21)
advanced arch_advanced_samba Mamba + Sliding attention 1:1 Samba (Microsoft)
advanced arch_advanced_retnet Pure RetNet (attention-free) Sun et al. 2023
expert arch_expert_research 4 mixers + MoE 3-phase Research-grade
expert arch_expert_mixtral_moe Pure attn + every-layer MoE (8 × top-2) Mixtral 8x7B
expert arch_expert_striped_hyena Hyena + Attention 4:1, 128K context StripedHyena
expert arch_expert_blackmamba_moe Mamba + MoE (MoE on non-attn mixer) BlackMamba, MoE-Mamba
expert arch_expert_deepseek_moe Fine-grained MoE (32 × top-3) DeepSeek-V2/V3
expert arch_expert_retnet_moe RetNet + MoE (predicted, no paper) Sun 2023 + MoE extrapolation
expert arch_expert_frontier_full_moe Attention-free, multi-mixer + all-MoE (most speculative) Composition prediction
expert arch_expert_progressive_stack Depth-wise: hyena→mamba→retnet→attn+MoE (no paper) Hierarchical prediction
expert arch_expert_dilated_longnet Temporal pyramid: mamba+sw(1K→4K→16K)+global+MoE (no paper) Longnet + Jamba extrapolation

Most target_params values are in the 1–2B range, allowing direct comparison of which architecture delivers what advantage at the same budget.

Core Ideas in the Level Progression

From beginner to intermediate — introducing sliding windows. arch_beginner_llama uses O(N²) global attention at every layer, meaning cost grows quadratically with sequence length. Intermediate-level arch_intermediate_mistral keeps only 1-in-4 layers as global attention and restricts the remaining 3 to 4,096-token sliding windows. This shrinks the total attention KV cache by roughly 4× and makes long-context processing dramatically cheaper.

From intermediate to advanced — replacing attention with Mamba. arch_advanced_jamba replaces 75% of layers with Mamba2 SSM. Mamba is O(N), so the quadratic bottleneck disappears. The remaining 25% attention acts as a "retrieval anchor" that preserves in-context learning. As a result, 32K-token long-context becomes practical, and inference throughput goes up substantially.

From advanced to expert — combining multiple mixers with MoE. The flagship expert preset arch_expert_research mixes all 4 EulerStack mixer types in one stack and adds MoE FFN to attention layers. The design splits the work into three phases by depth:

This kind of combination is aimed less at direct production deployment and more at cheaply testing the question "which mixer + MoE combination works best?"

Per-mixer deep dives live under mixers/.

2. llm_ Presets (Production Deployment, 24 presets)

5 sizes (0.1B / 0.8B / 2B / 4B / 16B) × up to 5 variants (simple / mistral / jamba / moe / mla) = 24 presets (MoE is skipped at 0.1B — not meaningful at that scale). What each variant is for:

Variant Description When to use
simple Pure attention (Llama-style) Most universal, best HF tooling compatibility
mistral Attention + sliding window KV cache savings, medium context (8–16K)
jamba Mamba + Attention hybrid Long-context priority (32K+), inference speed
moe Attention + MoE FFN (1-in-4) Maximum model capacity for a given active compute

Size / variant matrix:

Scale simple mistral jamba moe
0.8B llm_0p8b_simple llm_0p8b_mistral llm_0p8b_jamba llm_0p8b_moe
2B llm_2b_simple llm_2b_mistral llm_2b_jamba llm_2b_moe
4B llm_4b_simple llm_4b_mistral llm_4b_jamba llm_4b_moe
16B llm_16b_simple llm_16b_mistral llm_16b_jamba llm_16b_moe

Why MoE variants have a smaller d_model

The d_model of MoE variants is intentionally about 65% of the matching simple variant. MoE layers carry multiple expert FFNs, so keeping the same d_model would inflate the total parameter count. Reducing d_model keeps total parameters aligned with the named size (0.8B, 2B, etc.). As a result, active parameters (the ones each token actually uses) end up at about 1/4 of total parameters.

Size is not capped

Presets are only starting points. EulerStack can assemble models at 70B, 100B, or beyond by editing d_model and n_layers. Copy an existing preset and adjust freely.

3. arch_expert_*_mini Presets (Small-Scale Expert Experiments, 6 presets)

Expert-level presets are typically ~2B, which is heavy on a single consumer GPU. arch_expert_*_mini keeps the same design ideas but shrinks them to ~12 layers and d_model 384–512 — small enough to train on a 12–16 GB GPU.

Preset ~Total params Key point
arch_expert_progressive_stack_mini ~86M Run this first. Zone separation is cleanest
arch_expert_blackmamba_moe_mini ~156M Partial sparse MoE — most plausible at mini scale
arch_expert_mixtral_moe_mini ~175M Classic MoE baseline
arch_expert_dilated_longnet_mini ~83M Long-context pyramid (512→2K→8K)
arch_expert_deepseek_moe_mini ~357M Observe fine-grained MoE failure
arch_expert_frontier_full_moe_mini ~106M Attention-free + all-MoE, most experimental

08_expert_mini_walkthrough.md covers each mini's design rationale and trade-offs in detail.

Browsing and Using Presets

The first command to run after installation is the preset listing.

# Full list with parameter estimates
eulerstack presets list

# Details for a single preset
eulerstack presets show arch_advanced_jamba
eulerstack presets show llm_4b_moe

# Switch language
eulerstack --lang ko presets show llm_2b_jamba

presets list produces a tabular listing of names, estimated parameter counts, and one-line summaries. presets show prints the layer schedule, mixer config, and MoE config for a specific preset.

Validate → Compile → Export Pipeline

A typical end-to-end flow with a preset looks like this:

# 1) Validate (with realism report)
eulerstack validate --preset configs/presets/arch_advanced_jamba.yml --report

# 2) Inspect structure
eulerstack explain --preset configs/presets/arch_advanced_jamba.yml

# 3) Compile to JSON (inspection / debugging)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --print-config

# 4) Export as HF model directory (training-ready)
eulerstack compile --preset configs/presets/arch_advanced_jamba.yml --output-dir ./my_jamba

Each step is covered in more depth in Tutorial 4: Compile and Explain.

Loading in Python

The exported HuggingFace model directory loads through the standard AutoModelForCausalLM interface.

from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes

register_eulerstack_auto_classes()

# Same API as loading Llama / Mistral
model = AutoModelForCausalLM.from_pretrained("./my_jamba", trust_remote_code=True)
print(model.config.d_model, model.config.n_layers)

Runnable end-to-end scripts are in the examples/ directory.

Next Steps