3. Spec Reference
This chapter is the reference document that explains every field in an EulerStack YAML file. The previous chapter gave a tour of presets; this one pins down what each field means, what its constraints are, and how it interacts with other fields.
EulerStack's core value proposition is letting you assemble a model architecture from composable pieces. That entire assembly happens inside this single YAML file. Understanding this chapter therefore is the foundation for everything else you do with EulerStack.
The shape of a spec
Every YAML has eight core blocks + three optional v1 Phase-B blocks. Order does not matter to the validator, but convention is below.
schema_version: 1
# ── Eight core blocks (every YAML's backbone) ──
let: # (optional, Phase B1.2) numeric variables — ${let.x}
model: # 1. Model dimensions and overall size
tokenizer_contract: # 2. Which tokenizer this model expects
embedding: # 3. Token embeddings + positional encoding
layer_templates: # 4. Reusable layer blueprints
layer_schedule: # 5. How layers are stacked (v1 also supports
# parallel / integrator — see §15)
head: # 6. Output head (causal LM)
training_hints: # 7. Training-time suggestions (advisory)
compatibility: # 8. Compile target
# ── v1 Phase-B extension blocks (optional) ──
execution_modes: # (optional, Phase B5) 2-phase think/answer declaration
transition: # (optional, Phase B5) phase transition rule
Validation (eulerstack validate) confirms every block is present and free
of contradictions. Missing keys, typos, or type mismatches are caught
before any training starts.
v1 Phase-B new primitives (per-layer override, let expressions, MLA
latent_dim, branched mixer, ttt_layer, depth_gating / MoD, parallel / integrator schedule, template.memory, template.shape_change, execution_modes) are covered step-by-step in 09_new_primitives_walkthrough.md. Sections §1–§14 here focus on the core structure, and Phase-B extensions are indexed in §15.
1. schema_version
schema_version: 1
Schema version number. The current value is 1. It is mandatory — the
validator fails immediately if it is absent or not 1. The field exists so
that future schema versions can coexist with existing YAMLs without
breaking them. A version 2, if introduced, would be gated behind
schema_version: 2.
2. model block
Defines the basic dimensions and overall size of the model. These few numbers alone determine most of the model's "shape".
model:
name: "arch-beginner-llama"
family_hint: "llama-like"
d_model: 1536
vocab_size: 32000
max_seq_len: 4096
n_heads: 16
n_kv_heads: 4
mlp_ratio: 4
dtype: bfloat16
target_params: 1_000_000_000
2.1 name
A human-readable label. Not a Python identifier, so hyphens and spaces are
fine. The real identifier is the file name (arch_beginner_llama.yml).
This field shows up in presets show output and logs.
2.2 family_hint
Metadata indicating which architecture family the model belongs to.
Examples: llama-like, jamba-like, blackmamba-moe. Realism checks use
this to flag inconsistencies ("family_hint says llama, but the mixer is
mamba"). Warnings only — never blocks validation.
2.3 d_model (critical)
The hidden dimension of the model. Each token is represented
internally as a vector of this size. Every linear projection, attention
path, and FFN middle layer derives from d_model.
- Parameter count scales roughly with
d_model². - Attention compute scales with
d_model². - GPU memory is linear in
d_model.
Typical values:
| Model scale | Typical d_model |
|---|---|
| Small (~100M) | 512, 768 |
| Sub-GB (~1B) | 1024, 1280, 1536 |
| Mid (~2B) | 1536, 2048 |
| Large (~7B) | 4096 |
| Very large (~70B) | 8192 |
Critical constraint: d_model must be divisible by n_heads
(d_model % n_heads == 0). Violations fail validation immediately.
2.4 vocab_size
Tokenizer vocabulary size. Determines the embedding table and output head dimensions.
| Use case | Typical value |
|---|---|
| English-only (Llama-style) | 32000 |
| GPT-2 tokenizer | 50257 |
| Multilingual (Qwen) | 152064 |
| Very multilingual | ~256K |
If vocab_size is smaller than the actual tokenizer's vocabulary, training
raises index out of bounds. If larger, some embedding slots stay unused
(not fatal). Best practice: match the tokenizer's model_max_tokens
exactly.
2.5 max_seq_len
The maximum number of tokens the model can process in one pass — also called context length.
- RoPE buffers are precomputed proportional to this.
- KV cache scales linearly with
max_seq_len.
Typical values:
| Use case | max_seq_len |
|---|---|
| Short chat | 2048 |
| LLM default | 4096 |
| Medium docs | 8192, 16384 |
| Long-context | 32768 |
| Very long (still plausible) | 65536, 131072 |
Bigger max_seq_len inflates memory consumption fast. On a 24 GB GPU, a
1B model at 32K context with batch_size=1 is already tight.
2.6 n_heads
Number of attention heads. More heads lets the model split relationships
between tokens into finer subspaces. head_dim = d_model / n_heads is the
per-head dimension and is usually 64–128.
- Too small (
< 32) — representation too narrow → realism warning - Too large (
> 256) — head benefits diluted → realism warning
2.7 n_kv_heads (the GQA hinge)
The core of Grouped-Query Attention. Keep n_heads query heads but share
key/value across only n_kv_heads groups.
| Setting | Meaning |
|---|---|
n_kv_heads == n_heads |
MHA (Multi-Head Attention, classic Transformer / GPT-2) |
n_kv_heads < n_heads |
GQA (Llama 2/3 standard) |
n_kv_heads == 1 |
MQA (Multi-Query Attention, extreme sharing) |
Constraint: n_heads % n_kv_heads == 0. Example: n_heads=16, n_kv_heads=4
is a 4:1 GQA. n_heads=8, n_kv_heads=3 is not an integer division → fail.
Practical value: KV cache shrinks by n_heads / n_kv_heads. A 4:1 GQA
cuts inference KV memory to 1/4, dramatically improving concurrent
throughput. GQA is the default from Llama 2 70B onwards.
2.8 mlp_ratio
FFN middle-dimension multiplier. d_model * mlp_ratio is the FFN inner
dimension. 4 is the overwhelming default.
Gated MLPs like SwiGLU conventionally use 8/3 * d_model to match
non-gated parameter counts, but EulerStack's implementation simply uses
4.
2.9 dtype
Storage format for model weights.
| Value | Notes |
|---|---|
bfloat16 |
Current standard. Same range as float32, lower precision. Stable for GPU training |
float16 |
Older. Narrow exponent range, underflow-prone |
float32 |
High precision. 2× memory. Special cases only |
In practice, almost always bfloat16. All modern GPUs from A100
onward support it natively.
2.10 target_params
Declares the target parameter count. If the actual estimate differs by
more than 30%, a realism warning fires. Example: target_params: 1_000_000_000
is a "1B-scale" label.
This only affects validation. Real model size is determined by d_model,
number of layers, etc. Think of it as a "intent vs. reality" smoke
detector.
3. tokenizer_contract block
Specifies the "contract" between model and tokenizer: which tokenizer to use and how to treat special tokens.
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
3.1 type
The only current value is hf (HuggingFace). Other systems may be added
later.
3.2 pretrained
A HuggingFace model/tokenizer ID. At tokenizer-load time,
AutoTokenizer.from_pretrained(pretrained) is called.
Typical values:
| Value | Use case |
|---|---|
gpt2 |
English-focused, 50257 vocab |
meta-llama/Llama-2-7b-hf |
Llama family, 32000 vocab |
Qwen/Qwen2-7B |
Multilingual, 152064 vocab |
3.3 add_bos / add_eos
Whether to auto-insert Begin-Of-Sequence / End-Of-Sequence tokens at
document boundaries. These tell the model where one document ends and the
next begins. Almost always true.
4. embedding block
Token embedding table and positional encoding setup.
embedding:
type: learned
positional: rope
rope_theta: 500000.0
rope_scaling:
type: linear
factor: 2.0
tie_word_embeddings: true
4.1 type
learned is the only current option. Each token ID gets a learned
embedding vector.
4.2 positional
Position encoding type.
| Value | Description |
|---|---|
rope |
Rotary Position Embedding. Current standard |
none |
No positional encoding (for mixers that don't use positions explicitly, like Mamba) |
RoPE comes from the RoFormer paper (2021) and is used by basically every modern LLM since Llama. It expresses relative positions naturally and extrapolates (within limits) beyond trained sequence lengths.
4.3 rope_theta
The RoPE period parameter. Larger values give better positional resolution at long context.
| Value | Recommended max_seq_len |
|---|---|
10000.0 |
~2K (GPT-2, early Llama) |
500000.0 |
~32K (Llama 3) |
1000000.0 |
32K+ (Qwen long-context) |
4.4 rope_scaling
Extends RoPE's position range so the model stays stable on sequences longer than it was trained on.
rope_scaling:
type: linear # or dynamic, ntk
factor: 4.0 # 4× extension
linear × 4 means "extend an 8K-trained model to 32K". Linear scaling
compresses positions and may slightly hurt long-range accuracy. Verified in
the Qwen 2 long-context setup.
4.5 tie_word_embeddings
Whether to share weights between input embedding table and output LM head.
true saves vocab_size × d_model parameters. Standard in small models,
sometimes split in very large ones.
Must match head.tie_weights.
5. layer_templates block
EulerStack's core assembly unit. Each template is a blueprint for
what one layer looks like. Templates are named and referenced from
layer_schedule.
layer_templates:
decoder:
mixer:
type: attention
attention:
qkv_bias: false
attn_drop: 0.0
ffn:
type: gated_mlp
activation: swiglu
norm:
type: rmsnorm
position: pre
residual:
type: sequential
scaling: 1.0
state:
kv_cache: true
Each template has five sub-blocks:
mixer— token-to-token information mixing (attention or alternative)ffn— per-token processing (feedforward)norm— normalizationresidual— residual path structurestate— runtime state (inference cache)
5.1 mixer
Token mixer. Decides how tokens in the sequence attend to each other. EulerStack supports four mixer families.
attention
Standard multi-head self-attention.
mixer:
type: attention
attention:
qkv_bias: false # Use bias in QKV projections? (Llama: false, GPT-2: true)
attn_drop: 0.0 # Attention dropout (0 for large LLMs)
window: null # null = global; number = sliding window (e.g. 4096)
window: null(default) — global,O(N²)costwindow: 4096— sliding window, each token sees only the last 4096
mamba
Mamba2 SSM (State Space Model).
mixer:
type: mamba
mamba:
variant: mamba2 # mamba or mamba2
d_state: 128 # state dimension
d_conv: 4 # internal conv kernel size
expand: 2 # internal expansion ratio
Mamba processes the sequence in O(N) time. Much faster than attention at
long sequence lengths. Tradeoff: weaker at exact token-level recall.
retnet
Retention (RetNet).
mixer:
type: retnet
retnet:
chunkwise: true # enable chunked parallel computation
chunk_size: 128 # chunk size
rope: true # use internal RoPE
RetNet's distinctive property: train and inference modes give
mathematically equivalent results. Train in parallel mode, infer in
recurrent O(1) mode, same weights.
hyena
Hyena long-convolution.
mixer:
type: hyena
hyena:
depth: 2 # Hyena operator depth
filter_hidden: 64 # filter-generator MLP width
filter_decay: 0.0 # filter decay rate
Hyena does global convolution at O(N log N). Competitive at 128K+
contexts.
5.2 ffn
Feed-forward network, computed independently per token. Three options.
mlp
Standard 2-layer MLP.
ffn:
type: mlp
activation: gelu # relu, gelu, silu, etc.
Classic Transformer / GPT-2 style.
gated_mlp
Gated MLP (SwiGLU, GeGLU, etc.).
ffn:
type: gated_mlp
activation: swiglu # or geglu
Standard in modern LLMs since Llama. 1.5× more parameters than plain MLP but meaningfully better quality.
moe
Mixture-of-Experts. Multiple FFNs, only a subset active per token — "keep compute fixed, grow capacity".
ffn:
type: moe
moe:
experts: 8 # number of experts
top_k: 2 # experts activated per token
router: softmax # or sigmoid
z_loss: 0.001 # router regularization auxiliary loss
capacity_factor: 1.25 # over-capacity ratio per expert
See later tutorials for details: 07_arch_walkthrough.md expert section and 08_expert_mini_walkthrough.md.
5.3 norm
Normalization type and placement.
norm:
type: rmsnorm # or layernorm
position: pre # or post
| Combination | Use case |
|---|---|
layernorm + post |
Classic Transformer (GPT-2) |
rmsnorm + pre |
Modern LLM standard (Llama onwards) |
Pre-norm is strictly more stable than post-norm for deep models. Since ~2019 all large models use pre-norm.
RMSNorm is a simplified LayerNorm — skip the mean-subtraction step, only compute RMS. Half the parameters and faster.
5.4 residual
Residual connection structure.
residual:
type: sequential # or parallel
scaling: 1.0 # multiplier applied to F(x) before residual add
| type | Formula | Use case |
|---|---|---|
sequential |
x = x + F(norm(x)) |
Standard, most models |
parallel |
x = x + Attn(norm(x)) + FFN(norm(x)) computed in parallel |
PaLM-style |
scaling: 1.0 is default. Special uses (e.g., layer scale) may want a
smaller value.
5.5 state
What runtime state this layer needs.
state:
kv_cache: true # attention key/value cache
ssm_state: true # mamba recurrent state
Constraints (critical):
mixer.type: attentionlayers may setkv_cache: truemixer.type: mambalayers usessm_state: true- Not both:
mamba + kv_cache: truefails validation retnetmanages its own state (can omitstateblock)hyenahas no state
The validator enforces these automatically.
6. layer_schedule block
Specifies how templates stack — order and repetition. This is the most expressive part of EulerStack.
6.1 Basic syntax
layer_schedule:
- template: decoder
repeat: 24
Repeat decoder 24 times. Total layers = 24. Simple Llama-style.
6.2 Multiple templates (Jamba-style)
layer_templates:
mamba: { ... }
attn: { ... }
layer_schedule:
- { template: mamba, repeat: 3 }
- { template: attn, repeat: 1 }
- { template: mamba, repeat: 3 }
- { template: attn, repeat: 1 }
# 8 repetitions = 32 layers (24 mamba + 8 attn)
Different mixers at different depths. Jamba's 3:1 hybrid is expressed this way.
6.3 Progressive stack style
layer_schedule:
- { template: hyena_dense, repeat: 8 } # early: cheap & global
- { template: mamba_dense, repeat: 12 } # middle: linear sequential
- { template: retnet_dense, repeat: 8 } # late: chunked refinement
- { template: attn_moe, repeat: 4 } # tail: exact recall + MoE
Progressively more expensive mixers by depth. Vision's CNN → attention hierarchy applied to sequence modeling.
6.4 Pyramid style (Dilated LongNet)
layer_schedule:
- { template: mamba_prefix, repeat: 4 }
- { template: sw_1024, repeat: 4 }
- { template: sw_4096, repeat: 8 }
- { template: sw_16384, repeat: 8 }
- { template: global_moe, repeat: 8 }
Receptive field grows geometrically with depth. A 32-layer temporal pyramid from 1K to 16K to global.
6.5 Layer count
Total layer count is computed automatically:
n_layers = sum(entry.repeat for entry in layer_schedule)
This value appears in validate --report output and becomes the model's
real n_layers.
7. head block
Output head specification.
head:
type: causal_lm
tie_weights: true
7.1 type
Currently only causal_lm (autoregressive next-token prediction).
Classifier / encoder-decoder heads aren't in the schema yet.
7.2 tie_weights
Whether the LM head shares weights with the input embedding. Must match
embedding.tie_word_embeddings.
8. training_hints block
Training-time advisory values. Not part of the architecture itself — they tell downstream trainers "here's how to start training this model".
training_hints:
init: normal_0.02
dropout: 0.0
checkpointing: true
seed: 1234
8.1 init
Weight initialization. normal_0.02 means normal distribution with
standard deviation 0.02 (the GPT-2 standard, still in use).
8.2 dropout
Default dropout applied across the model. Large LLMs (1B+) use 0.0 —
with enough data, overfitting is rare enough that dropout just slows
training.
8.3 checkpointing
Whether gradient checkpointing is enabled. true saves memory at ~30%
speed cost. Useful on tight VRAM.
8.4 seed
Random seed for reproducibility.
9. compatibility block
compatibility:
compile_target: huggingface
Currently huggingface is the only compile target. vllm,
tensorrt-llm, and others may be added later.
10. Cross-field rules (very important)
Individual fields may each be valid while their combination is illegal. The validator enforces the following rules automatically.
10.1 Dimension divisibility
d_model % n_heads == 0 # heads divide d_model evenly
n_heads % n_kv_heads == 0 # GQA groups are integer-sized
Violation: ValidationError.
10.2 Mixer / state compatibility
| Mixer | Allowed state |
|---|---|
attention |
kv_cache: true allowed |
mamba |
ssm_state: true only (no kv_cache) |
retnet |
managed internally (omit state block) |
hyena |
stateless (omit state block) |
Violation: CompatibilityError.
10.3 FFN / MoE compatibility
Setting a moe: sub-block when ffn.type != moe is a mismatch. The
reverse (ffn.type: moe without moe: block) is also a mismatch.
10.4 MoE internal rules
top_k <= experts
experts > 0
top_k > 0
z_loss >= 0
capacity_factor >= 1.0
10.5 head / embedding consistency
head.tie_weights == embedding.tie_word_embeddings
Mismatched values fail validation because the compiler can't decide whether to share weights.
10.6 Realism warnings (non-blocking)
Validation passes but not recommended:
head_dim < 32orhead_dim > 256target_paramsdiverges from estimate by more than 30%experts > 64with smalltop_k(likely router collapse)max_seq_len / d_model > 64(unusual ratio)family_hintinconsistent with actual mixer composition- RoPE scaling factor > 8
11. Reading the validation report
Sample eulerstack validate --preset X.yml --report output:
==================================================
EulerStack Validation Report
==================================================
Schema: OK
Estimated params: 1.14B (1,141,927,936)
Total layers: 24
Target params: 1.00B
WARNINGS (1):
[realism.target_params] estimate 1.14B is 114% of target 1.00B
Result: WARN (0 errors, 1 warnings)
==================================================
Reading order:
Schema: OK— schema check pass/fail. If fail, nothing else appears.- Estimated parameters — actual size after compile. Compare against
target_params. - Total layers — expanded from
layer_schedule. - Errors list — any
ValidationErrororCompatibilityError. - Warnings list — realism checks. Advisory only.
- Result — error / warning count summary.
Any error blocks compilation.
12. Design checklist
Before writing a new YAML, answer these 5 questions:
- What task does this model solve? (Summarization / QA / code / multilingual…)
- Maximum context length needed? →
max_seq_len - How large is the training dataset? → informs
target_params - What's the inference GPU environment? → decides
d_model,n_kv_heads(GQA ratio) - Training budget? → informs layer count and mixer choice
Before committing, check these 3:
eulerstack validate --preset my_model.yml --reportpasses- Estimated params within 30% of
target_params eulerstack explainoutput matches intent
13. Common mistakes
| Symptom | Root cause | Fix |
|---|---|---|
ValidationError: d_model must be divisible by n_heads |
Dimension constraint | Set d_model to a multiple of n_heads |
CompatibilityError: mamba + kv_cache |
KV cache on a Mamba layer | Use state.ssm_state: true instead |
ValidationError: top_k > experts |
MoE misconfig | Ensure top_k <= experts |
| Estimate 2× target | MoE experts counted toward total | Adjust target_params to reflect full total |
index out of bounds during training |
vocab_size smaller than tokenizer |
Wrap with TokenizedDataset(vocab_size=ir.model.vocab_size) |
| Quality drops at long context | rope_theta too small |
Raise to 500000.0 or 1000000.0 |
14. Complete reference example
The complete YAML for a basic Llama-style 1B model. Every field is included.
schema_version: 1
model:
name: "my-llama-1b"
family_hint: "llama-like"
d_model: 1536
vocab_size: 32000
max_seq_len: 4096
n_heads: 16
n_kv_heads: 4
mlp_ratio: 4
dtype: bfloat16
target_params: 1_000_000_000
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
embedding:
type: learned
positional: rope
rope_theta: 500000.0
tie_word_embeddings: true
layer_templates:
decoder:
mixer:
type: attention
attention:
qkv_bias: false
attn_drop: 0.0
ffn:
type: gated_mlp
activation: swiglu
norm:
type: rmsnorm
position: pre
residual:
type: sequential
scaling: 1.0
state:
kv_cache: true
layer_schedule:
- { template: decoder, repeat: 24 }
head:
type: causal_lm
tie_weights: true
training_hints:
init: normal_0.02
dropout: 0.0
checkpointing: true
seed: 1234
compatibility:
compile_target: huggingface
Copy this file and change three fields and you have a completely different model. Examples:
d_model: 1536→d_model: 2048→ 2B scalemixer.type: attention→mixer.type: mamba→ pure Mambaffn.type: gated_mlp→ffn.type: moe+moe: { experts: 8, top_k: 2 }→ MoE
The feeling of YAML as "composable blueprint" comes quickly once you've copied this file ten times and varied a different field each time.
15. v1 Phase-B extension summary (new primitives)
v1 added 14 primitives in Phase B. YAML syntax and usage of each is covered step-by-step in 09_new_primitives_walkthrough.md. This §15 is just an index — location and one-liner use.
15.1 At layer_templates[*] scope
| Field | Use | Related § |
|---|---|---|
mixer.attention.latent_dim |
MLA — compress KV via latent (DeepSeek-V3) | §5.1 attention extension |
mixer.type: branched + branched: {...} |
Input-dependent sub-mixer selection (Jamba × per-token) | §5.1 mixer.type extension |
mixer.type: ttt_layer + ttt: {...} |
Test-Time Training (Sun 2024) | §5.1 mixer.type extension |
memory: {type, update_at_inference, ...} |
Titans parametric memory | new sub-block |
shape_change: {d_out, projection} |
Hourglass-style shape change | new sub-block |
15.2 At layer_schedule[*] scope
| Field | Use |
|---|---|
override: {...} |
Per-entry partial override (residual.scaling, attention.window, …) |
depth_gating: {enabled, capacity, router} |
Mixture-of-Depths — per-token layer skip |
parallel: [...] + merge: {...} |
Monoidal parallel streams (Mamba ∥ Attention) |
integrator: {type, steps, body, output} |
Universal Transformer / Diffusion-LM / Coconut unifier |
15.3 At top-level scope
| Field | Use |
|---|---|
let: {...} |
Numeric variables — reference as ${let.x * let.y} (Level-1 arithmetic) |
execution_modes: [...] |
R1-style 2-phase (think/answer) declaration |
transition: {type, token} |
Phase transition rule (special_token / budget_exhausted) |
15.4 Reserved namespaces
Keys under experimental.* / future.* / vendor.<name>.* produce a
WARNING, not an error. Reserved space for plugins and in-progress
research.
Next steps
- Tutorial 4: Compile and Explain — turn this YAML into a real HF model
- Tutorial 7: arch walkthrough — case studies of
how
arch_presets compose these fields (beginner 2 · intermediate 3 · advanced 5 · expert 12) - Tutorial 9: v1 new primitives walkthrough — step-by-step for each primitive indexed in §15
- Mixer overview — how attention / mamba / retnet / hyena / branched / ttt_layer work internally
- Tutorial 1: Validate a Spec — review the 3-layer validation structure