Home > EulerStack > Tutorials > 3. Spec Reference

3. Spec Reference

This chapter is the reference document that explains every field in an EulerStack YAML file. The previous chapter gave a tour of presets; this one pins down what each field means, what its constraints are, and how it interacts with other fields.

EulerStack's core value proposition is letting you assemble a model architecture from composable pieces. That entire assembly happens inside this single YAML file. Understanding this chapter therefore is the foundation for everything else you do with EulerStack.

The shape of a spec

Every YAML has eight core blocks + three optional v1 Phase-B blocks. Order does not matter to the validator, but convention is below.

schema_version: 1

# ── Eight core blocks (every YAML's backbone) ──
let:                  # (optional, Phase B1.2) numeric variables — ${let.x}
model:                # 1. Model dimensions and overall size
tokenizer_contract:   # 2. Which tokenizer this model expects
embedding:            # 3. Token embeddings + positional encoding
layer_templates:      # 4. Reusable layer blueprints
layer_schedule:       # 5. How layers are stacked (v1 also supports
                      #    parallel / integrator — see §15)
head:                 # 6. Output head (causal LM)
training_hints:       # 7. Training-time suggestions (advisory)
compatibility:        # 8. Compile target

# ── v1 Phase-B extension blocks (optional) ──
execution_modes:      # (optional, Phase B5) 2-phase think/answer declaration
transition:           # (optional, Phase B5) phase transition rule

Validation (eulerstack validate) confirms every block is present and free of contradictions. Missing keys, typos, or type mismatches are caught before any training starts.

v1 Phase-B new primitives (per-layer override, let expressions, MLA latent_dim, branched mixer, ttt_layer, depth_gating / MoD, parallel / integrator schedule, template.memory, template.shape_change, execution_modes) are covered step-by-step in 09_new_primitives_walkthrough.md. Sections §1–§14 here focus on the core structure, and Phase-B extensions are indexed in §15.


1. schema_version

schema_version: 1

Schema version number. The current value is 1. It is mandatory — the validator fails immediately if it is absent or not 1. The field exists so that future schema versions can coexist with existing YAMLs without breaking them. A version 2, if introduced, would be gated behind schema_version: 2.


2. model block

Defines the basic dimensions and overall size of the model. These few numbers alone determine most of the model's "shape".

model:
  name: "arch-beginner-llama"
  family_hint: "llama-like"
  d_model: 1536
  vocab_size: 32000
  max_seq_len: 4096
  n_heads: 16
  n_kv_heads: 4
  mlp_ratio: 4
  dtype: bfloat16
  target_params: 1_000_000_000

2.1 name

A human-readable label. Not a Python identifier, so hyphens and spaces are fine. The real identifier is the file name (arch_beginner_llama.yml). This field shows up in presets show output and logs.

2.2 family_hint

Metadata indicating which architecture family the model belongs to. Examples: llama-like, jamba-like, blackmamba-moe. Realism checks use this to flag inconsistencies ("family_hint says llama, but the mixer is mamba"). Warnings only — never blocks validation.

2.3 d_model (critical)

The hidden dimension of the model. Each token is represented internally as a vector of this size. Every linear projection, attention path, and FFN middle layer derives from d_model.

Typical values:

Model scale Typical d_model
Small (~100M) 512, 768
Sub-GB (~1B) 1024, 1280, 1536
Mid (~2B) 1536, 2048
Large (~7B) 4096
Very large (~70B) 8192

Critical constraint: d_model must be divisible by n_heads (d_model % n_heads == 0). Violations fail validation immediately.

2.4 vocab_size

Tokenizer vocabulary size. Determines the embedding table and output head dimensions.

Use case Typical value
English-only (Llama-style) 32000
GPT-2 tokenizer 50257
Multilingual (Qwen) 152064
Very multilingual ~256K

If vocab_size is smaller than the actual tokenizer's vocabulary, training raises index out of bounds. If larger, some embedding slots stay unused (not fatal). Best practice: match the tokenizer's model_max_tokens exactly.

2.5 max_seq_len

The maximum number of tokens the model can process in one pass — also called context length.

Typical values:

Use case max_seq_len
Short chat 2048
LLM default 4096
Medium docs 8192, 16384
Long-context 32768
Very long (still plausible) 65536, 131072

Bigger max_seq_len inflates memory consumption fast. On a 24 GB GPU, a 1B model at 32K context with batch_size=1 is already tight.

2.6 n_heads

Number of attention heads. More heads lets the model split relationships between tokens into finer subspaces. head_dim = d_model / n_heads is the per-head dimension and is usually 64–128.

2.7 n_kv_heads (the GQA hinge)

The core of Grouped-Query Attention. Keep n_heads query heads but share key/value across only n_kv_heads groups.

Setting Meaning
n_kv_heads == n_heads MHA (Multi-Head Attention, classic Transformer / GPT-2)
n_kv_heads < n_heads GQA (Llama 2/3 standard)
n_kv_heads == 1 MQA (Multi-Query Attention, extreme sharing)

Constraint: n_heads % n_kv_heads == 0. Example: n_heads=16, n_kv_heads=4 is a 4:1 GQA. n_heads=8, n_kv_heads=3 is not an integer division → fail.

Practical value: KV cache shrinks by n_heads / n_kv_heads. A 4:1 GQA cuts inference KV memory to 1/4, dramatically improving concurrent throughput. GQA is the default from Llama 2 70B onwards.

2.8 mlp_ratio

FFN middle-dimension multiplier. d_model * mlp_ratio is the FFN inner dimension. 4 is the overwhelming default.

Gated MLPs like SwiGLU conventionally use 8/3 * d_model to match non-gated parameter counts, but EulerStack's implementation simply uses 4.

2.9 dtype

Storage format for model weights.

Value Notes
bfloat16 Current standard. Same range as float32, lower precision. Stable for GPU training
float16 Older. Narrow exponent range, underflow-prone
float32 High precision. 2× memory. Special cases only

In practice, almost always bfloat16. All modern GPUs from A100 onward support it natively.

2.10 target_params

Declares the target parameter count. If the actual estimate differs by more than 30%, a realism warning fires. Example: target_params: 1_000_000_000 is a "1B-scale" label.

This only affects validation. Real model size is determined by d_model, number of layers, etc. Think of it as a "intent vs. reality" smoke detector.


3. tokenizer_contract block

Specifies the "contract" between model and tokenizer: which tokenizer to use and how to treat special tokens.

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

3.1 type

The only current value is hf (HuggingFace). Other systems may be added later.

3.2 pretrained

A HuggingFace model/tokenizer ID. At tokenizer-load time, AutoTokenizer.from_pretrained(pretrained) is called.

Typical values:

Value Use case
gpt2 English-focused, 50257 vocab
meta-llama/Llama-2-7b-hf Llama family, 32000 vocab
Qwen/Qwen2-7B Multilingual, 152064 vocab

3.3 add_bos / add_eos

Whether to auto-insert Begin-Of-Sequence / End-Of-Sequence tokens at document boundaries. These tell the model where one document ends and the next begins. Almost always true.


4. embedding block

Token embedding table and positional encoding setup.

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  rope_scaling:
    type: linear
    factor: 2.0
  tie_word_embeddings: true

4.1 type

learned is the only current option. Each token ID gets a learned embedding vector.

4.2 positional

Position encoding type.

Value Description
rope Rotary Position Embedding. Current standard
none No positional encoding (for mixers that don't use positions explicitly, like Mamba)

RoPE comes from the RoFormer paper (2021) and is used by basically every modern LLM since Llama. It expresses relative positions naturally and extrapolates (within limits) beyond trained sequence lengths.

4.3 rope_theta

The RoPE period parameter. Larger values give better positional resolution at long context.

Value Recommended max_seq_len
10000.0 ~2K (GPT-2, early Llama)
500000.0 ~32K (Llama 3)
1000000.0 32K+ (Qwen long-context)

4.4 rope_scaling

Extends RoPE's position range so the model stays stable on sequences longer than it was trained on.

rope_scaling:
  type: linear      # or dynamic, ntk
  factor: 4.0       # 4× extension

linear × 4 means "extend an 8K-trained model to 32K". Linear scaling compresses positions and may slightly hurt long-range accuracy. Verified in the Qwen 2 long-context setup.

4.5 tie_word_embeddings

Whether to share weights between input embedding table and output LM head. true saves vocab_size × d_model parameters. Standard in small models, sometimes split in very large ones.

Must match head.tie_weights.


5. layer_templates block

EulerStack's core assembly unit. Each template is a blueprint for what one layer looks like. Templates are named and referenced from layer_schedule.

layer_templates:
  decoder:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
    ffn:
      type: gated_mlp
      activation: swiglu
    norm:
      type: rmsnorm
      position: pre
    residual:
      type: sequential
      scaling: 1.0
    state:
      kv_cache: true

Each template has five sub-blocks:

  1. mixer — token-to-token information mixing (attention or alternative)
  2. ffn — per-token processing (feedforward)
  3. norm — normalization
  4. residual — residual path structure
  5. state — runtime state (inference cache)

5.1 mixer

Token mixer. Decides how tokens in the sequence attend to each other. EulerStack supports four mixer families.

attention

Standard multi-head self-attention.

mixer:
  type: attention
  attention:
    qkv_bias: false     # Use bias in QKV projections? (Llama: false, GPT-2: true)
    attn_drop: 0.0      # Attention dropout (0 for large LLMs)
    window: null        # null = global; number = sliding window (e.g. 4096)

mamba

Mamba2 SSM (State Space Model).

mixer:
  type: mamba
  mamba:
    variant: mamba2    # mamba or mamba2
    d_state: 128       # state dimension
    d_conv: 4          # internal conv kernel size
    expand: 2          # internal expansion ratio

Mamba processes the sequence in O(N) time. Much faster than attention at long sequence lengths. Tradeoff: weaker at exact token-level recall.

retnet

Retention (RetNet).

mixer:
  type: retnet
  retnet:
    chunkwise: true      # enable chunked parallel computation
    chunk_size: 128      # chunk size
    rope: true           # use internal RoPE

RetNet's distinctive property: train and inference modes give mathematically equivalent results. Train in parallel mode, infer in recurrent O(1) mode, same weights.

hyena

Hyena long-convolution.

mixer:
  type: hyena
  hyena:
    depth: 2              # Hyena operator depth
    filter_hidden: 64     # filter-generator MLP width
    filter_decay: 0.0     # filter decay rate

Hyena does global convolution at O(N log N). Competitive at 128K+ contexts.

5.2 ffn

Feed-forward network, computed independently per token. Three options.

mlp

Standard 2-layer MLP.

ffn:
  type: mlp
  activation: gelu       # relu, gelu, silu, etc.

Classic Transformer / GPT-2 style.

gated_mlp

Gated MLP (SwiGLU, GeGLU, etc.).

ffn:
  type: gated_mlp
  activation: swiglu     # or geglu

Standard in modern LLMs since Llama. 1.5× more parameters than plain MLP but meaningfully better quality.

moe

Mixture-of-Experts. Multiple FFNs, only a subset active per token — "keep compute fixed, grow capacity".

ffn:
  type: moe
  moe:
    experts: 8           # number of experts
    top_k: 2             # experts activated per token
    router: softmax      # or sigmoid
    z_loss: 0.001        # router regularization auxiliary loss
    capacity_factor: 1.25 # over-capacity ratio per expert

See later tutorials for details: 07_arch_walkthrough.md expert section and 08_expert_mini_walkthrough.md.

5.3 norm

Normalization type and placement.

norm:
  type: rmsnorm          # or layernorm
  position: pre          # or post
Combination Use case
layernorm + post Classic Transformer (GPT-2)
rmsnorm + pre Modern LLM standard (Llama onwards)

Pre-norm is strictly more stable than post-norm for deep models. Since ~2019 all large models use pre-norm.

RMSNorm is a simplified LayerNorm — skip the mean-subtraction step, only compute RMS. Half the parameters and faster.

5.4 residual

Residual connection structure.

residual:
  type: sequential       # or parallel
  scaling: 1.0           # multiplier applied to F(x) before residual add
type Formula Use case
sequential x = x + F(norm(x)) Standard, most models
parallel x = x + Attn(norm(x)) + FFN(norm(x)) computed in parallel PaLM-style

scaling: 1.0 is default. Special uses (e.g., layer scale) may want a smaller value.

5.5 state

What runtime state this layer needs.

state:
  kv_cache: true         # attention key/value cache
  ssm_state: true        # mamba recurrent state

Constraints (critical):

The validator enforces these automatically.


6. layer_schedule block

Specifies how templates stack — order and repetition. This is the most expressive part of EulerStack.

6.1 Basic syntax

layer_schedule:
  - template: decoder
    repeat: 24

Repeat decoder 24 times. Total layers = 24. Simple Llama-style.

6.2 Multiple templates (Jamba-style)

layer_templates:
  mamba: { ... }
  attn: { ... }

layer_schedule:
  - { template: mamba, repeat: 3 }
  - { template: attn,  repeat: 1 }
  - { template: mamba, repeat: 3 }
  - { template: attn,  repeat: 1 }
  # 8 repetitions = 32 layers (24 mamba + 8 attn)

Different mixers at different depths. Jamba's 3:1 hybrid is expressed this way.

6.3 Progressive stack style

layer_schedule:
  - { template: hyena_dense,  repeat: 8 }    # early: cheap & global
  - { template: mamba_dense,  repeat: 12 }   # middle: linear sequential
  - { template: retnet_dense, repeat: 8 }    # late: chunked refinement
  - { template: attn_moe,     repeat: 4 }    # tail: exact recall + MoE

Progressively more expensive mixers by depth. Vision's CNN → attention hierarchy applied to sequence modeling.

6.4 Pyramid style (Dilated LongNet)

layer_schedule:
  - { template: mamba_prefix, repeat: 4 }
  - { template: sw_1024,      repeat: 4 }
  - { template: sw_4096,      repeat: 8 }
  - { template: sw_16384,     repeat: 8 }
  - { template: global_moe,   repeat: 8 }

Receptive field grows geometrically with depth. A 32-layer temporal pyramid from 1K to 16K to global.

6.5 Layer count

Total layer count is computed automatically:

n_layers = sum(entry.repeat for entry in layer_schedule)

This value appears in validate --report output and becomes the model's real n_layers.


7. head block

Output head specification.

head:
  type: causal_lm
  tie_weights: true

7.1 type

Currently only causal_lm (autoregressive next-token prediction). Classifier / encoder-decoder heads aren't in the schema yet.

7.2 tie_weights

Whether the LM head shares weights with the input embedding. Must match embedding.tie_word_embeddings.


8. training_hints block

Training-time advisory values. Not part of the architecture itself — they tell downstream trainers "here's how to start training this model".

training_hints:
  init: normal_0.02
  dropout: 0.0
  checkpointing: true
  seed: 1234

8.1 init

Weight initialization. normal_0.02 means normal distribution with standard deviation 0.02 (the GPT-2 standard, still in use).

8.2 dropout

Default dropout applied across the model. Large LLMs (1B+) use 0.0 — with enough data, overfitting is rare enough that dropout just slows training.

8.3 checkpointing

Whether gradient checkpointing is enabled. true saves memory at ~30% speed cost. Useful on tight VRAM.

8.4 seed

Random seed for reproducibility.


9. compatibility block

compatibility:
  compile_target: huggingface

Currently huggingface is the only compile target. vllm, tensorrt-llm, and others may be added later.


10. Cross-field rules (very important)

Individual fields may each be valid while their combination is illegal. The validator enforces the following rules automatically.

10.1 Dimension divisibility

d_model % n_heads == 0       # heads divide d_model evenly
n_heads % n_kv_heads == 0    # GQA groups are integer-sized

Violation: ValidationError.

10.2 Mixer / state compatibility

Mixer Allowed state
attention kv_cache: true allowed
mamba ssm_state: true only (no kv_cache)
retnet managed internally (omit state block)
hyena stateless (omit state block)

Violation: CompatibilityError.

10.3 FFN / MoE compatibility

Setting a moe: sub-block when ffn.type != moe is a mismatch. The reverse (ffn.type: moe without moe: block) is also a mismatch.

10.4 MoE internal rules

top_k <= experts
experts > 0
top_k > 0
z_loss >= 0
capacity_factor >= 1.0

10.5 head / embedding consistency

head.tie_weights == embedding.tie_word_embeddings

Mismatched values fail validation because the compiler can't decide whether to share weights.

10.6 Realism warnings (non-blocking)

Validation passes but not recommended:


11. Reading the validation report

Sample eulerstack validate --preset X.yml --report output:

==================================================
EulerStack Validation Report
==================================================
Schema: OK
Estimated params: 1.14B (1,141,927,936)
Total layers: 24
Target params: 1.00B

WARNINGS (1):
  [realism.target_params] estimate 1.14B is 114% of target 1.00B

Result: WARN (0 errors, 1 warnings)
==================================================

Reading order:

  1. Schema: OK — schema check pass/fail. If fail, nothing else appears.
  2. Estimated parameters — actual size after compile. Compare against target_params.
  3. Total layers — expanded from layer_schedule.
  4. Errors list — any ValidationError or CompatibilityError.
  5. Warnings list — realism checks. Advisory only.
  6. Result — error / warning count summary.

Any error blocks compilation.


12. Design checklist

Before writing a new YAML, answer these 5 questions:

  1. What task does this model solve? (Summarization / QA / code / multilingual…)
  2. Maximum context length needed? → max_seq_len
  3. How large is the training dataset? → informs target_params
  4. What's the inference GPU environment? → decides d_model, n_kv_heads (GQA ratio)
  5. Training budget? → informs layer count and mixer choice

Before committing, check these 3:

  1. eulerstack validate --preset my_model.yml --report passes
  2. Estimated params within 30% of target_params
  3. eulerstack explain output matches intent

13. Common mistakes

Symptom Root cause Fix
ValidationError: d_model must be divisible by n_heads Dimension constraint Set d_model to a multiple of n_heads
CompatibilityError: mamba + kv_cache KV cache on a Mamba layer Use state.ssm_state: true instead
ValidationError: top_k > experts MoE misconfig Ensure top_k <= experts
Estimate 2× target MoE experts counted toward total Adjust target_params to reflect full total
index out of bounds during training vocab_size smaller than tokenizer Wrap with TokenizedDataset(vocab_size=ir.model.vocab_size)
Quality drops at long context rope_theta too small Raise to 500000.0 or 1000000.0

14. Complete reference example

The complete YAML for a basic Llama-style 1B model. Every field is included.

schema_version: 1

model:
  name: "my-llama-1b"
  family_hint: "llama-like"
  d_model: 1536
  vocab_size: 32000
  max_seq_len: 4096
  n_heads: 16
  n_kv_heads: 4
  mlp_ratio: 4
  dtype: bfloat16
  target_params: 1_000_000_000

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  tie_word_embeddings: true

layer_templates:
  decoder:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
    ffn:
      type: gated_mlp
      activation: swiglu
    norm:
      type: rmsnorm
      position: pre
    residual:
      type: sequential
      scaling: 1.0
    state:
      kv_cache: true

layer_schedule:
  - { template: decoder, repeat: 24 }

head:
  type: causal_lm
  tie_weights: true

training_hints:
  init: normal_0.02
  dropout: 0.0
  checkpointing: true
  seed: 1234

compatibility:
  compile_target: huggingface

Copy this file and change three fields and you have a completely different model. Examples:

The feeling of YAML as "composable blueprint" comes quickly once you've copied this file ten times and varied a different field each time.


15. v1 Phase-B extension summary (new primitives)

v1 added 14 primitives in Phase B. YAML syntax and usage of each is covered step-by-step in 09_new_primitives_walkthrough.md. This §15 is just an index — location and one-liner use.

15.1 At layer_templates[*] scope

Field Use Related §
mixer.attention.latent_dim MLA — compress KV via latent (DeepSeek-V3) §5.1 attention extension
mixer.type: branched + branched: {...} Input-dependent sub-mixer selection (Jamba × per-token) §5.1 mixer.type extension
mixer.type: ttt_layer + ttt: {...} Test-Time Training (Sun 2024) §5.1 mixer.type extension
memory: {type, update_at_inference, ...} Titans parametric memory new sub-block
shape_change: {d_out, projection} Hourglass-style shape change new sub-block

15.2 At layer_schedule[*] scope

Field Use
override: {...} Per-entry partial override (residual.scaling, attention.window, …)
depth_gating: {enabled, capacity, router} Mixture-of-Depths — per-token layer skip
parallel: [...] + merge: {...} Monoidal parallel streams (Mamba ∥ Attention)
integrator: {type, steps, body, output} Universal Transformer / Diffusion-LM / Coconut unifier

15.3 At top-level scope

Field Use
let: {...} Numeric variables — reference as ${let.x * let.y} (Level-1 arithmetic)
execution_modes: [...] R1-style 2-phase (think/answer) declaration
transition: {type, token} Phase transition rule (special_token / budget_exhausted)

15.4 Reserved namespaces

Keys under experimental.* / future.* / vendor.<name>.* produce a WARNING, not an error. Reserved space for plugins and in-progress research.


Next steps