3. Spec Reference

This chapter is the reference document that explains every field in an EulerStack YAML file. The previous chapter gave a tour of presets; this one pins down what each field means, what its constraints are, and how it interacts with other fields.

EulerStack's core value proposition is letting you assemble a model architecture from composable pieces. That entire assembly happens inside this single YAML file. Understanding this chapter therefore is the foundation for everything else you do with EulerStack.

The shape of a spec

Every YAML has eight core blocks + three optional v1 Phase-B blocks. Order does not matter to the validator, but convention is below.

schema_version: 1

# ── Eight core blocks (every YAML's backbone) ──
let:                  # (optional, Phase B1.2) numeric variables — ${let.x}
model:                # 1. Model dimensions and overall size
tokenizer_contract:   # 2. Which tokenizer this model expects
embedding:            # 3. Token embeddings + positional encoding
layer_templates:      # 4. Reusable layer blueprints
layer_schedule:       # 5. How layers are stacked (v1 also supports
                      #    parallel / integrator — see §15)
head:                 # 6. Output head (causal LM)
training_hints:       # 7. Training-time suggestions (advisory)
compatibility:        # 8. Compile target

# ── v1 Phase-B extension blocks (optional) ──
execution_modes:      # (optional, Phase B5) 2-phase think/answer declaration
transition:           # (optional, Phase B5) phase transition rule

Validation (eulerstack validate) confirms every block is present and free of contradictions. Missing keys, typos, or type mismatches are caught before any training starts.

v1 Phase-B new primitives (per-layer override, let expressions, MLA latent_dim, branched mixer, ttt_layer, depth_gating / MoD, parallel / integrator schedule, template.memory, template.shape_change, execution_modes) are covered step-by-step in 09_new_primitives_walkthrough.md. Sections §1–§14 here focus on the core structure, and Phase-B extensions are indexed in §15.

1. schema_version

schema_version: 1

Schema version number. The current value is 1. It is mandatory — the validator fails immediately if it is absent or not 1. The field exists so that future schema versions can coexist with existing YAMLs without breaking them. A version 2, if introduced, would be gated behind schema_version: 2.

2. model block

Defines the basic dimensions and overall size of the model. These few numbers alone determine most of the model's "shape".

model:
  name: "arch-beginner-llama"
  family_hint: "llama-like"
  d_model: 1536
  vocab_size: 32000
  max_seq_len: 4096
  n_heads: 16
  n_kv_heads: 4
  mlp_ratio: 4
  dtype: bfloat16
  target_params: 1_000_000_000

2.1 `name`

A human-readable label. Not a Python identifier, so hyphens and spaces are fine. The real identifier is the file name (arch_beginner_llama.yml). This field shows up in presets show output and logs.

2.2 `family_hint`

Metadata indicating which architecture family the model belongs to. Examples: llama-like, jamba-like, blackmamba-moe. Realism checks use this to flag inconsistencies ("family_hint says llama, but the mixer is mamba"). Warnings only — never blocks validation.

2.3 `d_model` (critical)

The hidden dimension of the model. Each token is represented internally as a vector of this size. Every linear projection, attention path, and FFN middle layer derives from d_model.

Parameter count scales roughly with d_model².
Attention compute scales with d_model².
GPU memory is linear in d_model.

Typical values:

Model scale	Typical d_model
Small (~100M)	512, 768
Sub-GB (~1B)	1024, 1280, 1536
Mid (~2B)	1536, 2048
Large (~7B)	4096
Very large (~70B)	8192

Critical constraint: d_model must be divisible by n_heads (d_model % n_heads == 0). Violations fail validation immediately.

2.4 `vocab_size`

Tokenizer vocabulary size. Determines the embedding table and output head dimensions.

Use case	Typical value
English-only (Llama-style)	32000
GPT-2 tokenizer	50257
Multilingual (Qwen)	152064
Very multilingual	~256K

If vocab_size is smaller than the actual tokenizer's vocabulary, training raises index out of bounds. If larger, some embedding slots stay unused (not fatal). Best practice: match the tokenizer's model_max_tokens exactly.

2.5 `max_seq_len`

The maximum number of tokens the model can process in one pass — also called context length.

RoPE buffers are precomputed proportional to this.
KV cache scales linearly with max_seq_len.

Typical values:

Use case	max_seq_len
Short chat	2048
LLM default	4096
Medium docs	8192, 16384
Long-context	32768
Very long (still plausible)	65536, 131072

Bigger max_seq_len inflates memory consumption fast. On a 24 GB GPU, a 1B model at 32K context with batch_size=1 is already tight.

2.6 `n_heads`

Number of attention heads. More heads lets the model split relationships between tokens into finer subspaces. head_dim = d_model / n_heads is the per-head dimension and is usually 64–128.

Too small (< 32) — representation too narrow → realism warning
Too large (> 256) — head benefits diluted → realism warning

2.7 `n_kv_heads` (the GQA hinge)

The core of Grouped-Query Attention. Keep n_heads query heads but share key/value across only n_kv_heads groups.

Setting	Meaning
`n_kv_heads == n_heads`	MHA (Multi-Head Attention, classic Transformer / GPT-2)
`n_kv_heads < n_heads`	GQA (Llama 2/3 standard)
`n_kv_heads == 1`	MQA (Multi-Query Attention, extreme sharing)

Constraint: n_heads % n_kv_heads == 0. Example: n_heads=16, n_kv_heads=4 is a 4:1 GQA. n_heads=8, n_kv_heads=3 is not an integer division → fail.

Practical value: KV cache shrinks by n_heads / n_kv_heads. A 4:1 GQA cuts inference KV memory to 1/4, dramatically improving concurrent throughput. GQA is the default from Llama 2 70B onwards.

2.8 `mlp_ratio`

FFN middle-dimension multiplier. d_model * mlp_ratio is the FFN inner dimension. 4 is the overwhelming default.

Gated MLPs like SwiGLU conventionally use 8/3 * d_model to match non-gated parameter counts, but EulerStack's implementation simply uses 4.

2.9 `dtype`

Storage format for model weights.

Value	Notes
`bfloat16`	Current standard. Same range as float32, lower precision. Stable for GPU training
`float16`	Older. Narrow exponent range, underflow-prone
`float32`	High precision. 2× memory. Special cases only

In practice, almost always bfloat16. All modern GPUs from A100 onward support it natively.

2.10 `target_params`

Declares the target parameter count. If the actual estimate differs by more than 30%, a realism warning fires. Example: target_params: 1_000_000_000 is a "1B-scale" label.

This only affects validation. Real model size is determined by d_model, number of layers, etc. Think of it as a "intent vs. reality" smoke detector.

3. tokenizer_contract block

Specifies the "contract" between model and tokenizer: which tokenizer to use and how to treat special tokens.

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

3.1 `type`

The only current value is hf (HuggingFace). Other systems may be added later.

3.2 `pretrained`

A HuggingFace model/tokenizer ID. At tokenizer-load time, AutoTokenizer.from_pretrained(pretrained) is called.

Typical values:

Value	Use case
`gpt2`	English-focused, 50257 vocab
`meta-llama/Llama-2-7b-hf`	Llama family, 32000 vocab
`Qwen/Qwen2-7B`	Multilingual, 152064 vocab

3.3 `add_bos` / `add_eos`

Whether to auto-insert Begin-Of-Sequence / End-Of-Sequence tokens at document boundaries. These tell the model where one document ends and the next begins. Almost always true.

4. embedding block

Token embedding table and positional encoding setup.

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  rope_scaling:
    type: linear
    factor: 2.0
  tie_word_embeddings: true

4.1 `type`

learned is the only current option. Each token ID gets a learned embedding vector.

4.2 `positional`

Position encoding type.

Value	Description
`rope`	Rotary Position Embedding. Current standard
`none`	No positional encoding (for mixers that don't use positions explicitly, like Mamba)

RoPE comes from the RoFormer paper (2021) and is used by basically every modern LLM since Llama. It expresses relative positions naturally and extrapolates (within limits) beyond trained sequence lengths.

4.3 `rope_theta`

The RoPE period parameter. Larger values give better positional resolution at long context.

Value	Recommended max_seq_len
`10000.0`	~2K (GPT-2, early Llama)
`500000.0`	~32K (Llama 3)
`1000000.0`	32K+ (Qwen long-context)

4.4 `rope_scaling`

Extends RoPE's position range so the model stays stable on sequences longer than it was trained on.

rope_scaling:
  type: linear      # or dynamic, ntk
  factor: 4.0       # 4× extension

linear × 4 means "extend an 8K-trained model to 32K". Linear scaling compresses positions and may slightly hurt long-range accuracy. Verified in the Qwen 2 long-context setup.

4.5 `tie_word_embeddings`

Whether to share weights between input embedding table and output LM head. true saves vocab_size × d_model parameters. Standard in small models, sometimes split in very large ones.

Must match head.tie_weights.

5. layer_templates block

EulerStack's core assembly unit. Each template is a blueprint for what one layer looks like. Templates are named and referenced from layer_schedule.

layer_templates:
  decoder:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
    ffn:
      type: gated_mlp
      activation: swiglu
    norm:
      type: rmsnorm
      position: pre
    residual:
      type: sequential
      scaling: 1.0
    state:
      kv_cache: true

Each template has five sub-blocks:

mixer — token-to-token information mixing (attention or alternative)
ffn — per-token processing (feedforward)
norm — normalization
residual — residual path structure
state — runtime state (inference cache)

5.1 mixer

Token mixer. Decides how tokens in the sequence attend to each other. EulerStack supports four mixer families.

attention

Standard multi-head self-attention.

mixer:
  type: attention
  attention:
    qkv_bias: false     # Use bias in QKV projections? (Llama: false, GPT-2: true)
    attn_drop: 0.0      # Attention dropout (0 for large LLMs)
    window: null        # null = global; number = sliding window (e.g. 4096)

window: null (default) — global, O(N²) cost
window: 4096 — sliding window, each token sees only the last 4096

mamba

Mamba2 SSM (State Space Model).

mixer:
  type: mamba
  mamba:
    variant: mamba2    # mamba or mamba2
    d_state: 128       # state dimension
    d_conv: 4          # internal conv kernel size
    expand: 2          # internal expansion ratio

Mamba processes the sequence in O(N) time. Much faster than attention at long sequence lengths. Tradeoff: weaker at exact token-level recall.

retnet

Retention (RetNet).

mixer:
  type: retnet
  retnet:
    chunkwise: true      # enable chunked parallel computation
    chunk_size: 128      # chunk size
    rope: true           # use internal RoPE

RetNet's distinctive property: train and inference modes give mathematically equivalent results. Train in parallel mode, infer in recurrent O(1) mode, same weights.

hyena

Hyena long-convolution.

mixer:
  type: hyena
  hyena:
    depth: 2              # Hyena operator depth
    filter_hidden: 64     # filter-generator MLP width
    filter_decay: 0.0     # filter decay rate

Hyena does global convolution at O(N log N). Competitive at 128K+ contexts.

5.2 ffn

Feed-forward network, computed independently per token. Three options.

mlp

Standard 2-layer MLP.

ffn:
  type: mlp
  activation: gelu       # relu, gelu, silu, etc.

Classic Transformer / GPT-2 style.

gated_mlp

Gated MLP (SwiGLU, GeGLU, etc.).

ffn:
  type: gated_mlp
  activation: swiglu     # or geglu

Standard in modern LLMs since Llama. 1.5× more parameters than plain MLP but meaningfully better quality.

moe

Mixture-of-Experts. Multiple FFNs, only a subset active per token — "keep compute fixed, grow capacity".

ffn:
  type: moe
  moe:
    experts: 8           # number of experts
    top_k: 2             # experts activated per token
    router: softmax      # or sigmoid
    z_loss: 0.001        # router regularization auxiliary loss
    capacity_factor: 1.25 # over-capacity ratio per expert

See later tutorials for details: 07_arch_walkthrough.md expert section and 08_expert_mini_walkthrough.md.

5.3 norm

Normalization type and placement.

norm:
  type: rmsnorm          # or layernorm
  position: pre          # or post

Combination	Use case
`layernorm` + `post`	Classic Transformer (GPT-2)
`rmsnorm` + `pre`	Modern LLM standard (Llama onwards)

Pre-norm is strictly more stable than post-norm for deep models. Since ~2019 all large models use pre-norm.

RMSNorm is a simplified LayerNorm — skip the mean-subtraction step, only compute RMS. Half the parameters and faster.

5.4 residual

Residual connection structure.

residual:
  type: sequential       # or parallel
  scaling: 1.0           # multiplier applied to F(x) before residual add

type	Formula	Use case
`sequential`	`x = x + F(norm(x))`	Standard, most models
`parallel`	`x = x + Attn(norm(x)) + FFN(norm(x))` computed in parallel	PaLM-style

scaling: 1.0 is default. Special uses (e.g., layer scale) may want a smaller value.

5.5 state

What runtime state this layer needs.

state:
  kv_cache: true         # attention key/value cache
  ssm_state: true        # mamba recurrent state

Constraints (critical):

mixer.type: attention layers may set kv_cache: true
mixer.type: mamba layers use ssm_state: true
Not both: mamba + kv_cache: true fails validation
retnet manages its own state (can omit state block)
hyena has no state

The validator enforces these automatically.

6. layer_schedule block

Specifies how templates stack — order and repetition. This is the most expressive part of EulerStack.

6.1 Basic syntax

layer_schedule:
  - template: decoder
    repeat: 24

Repeat decoder 24 times. Total layers = 24. Simple Llama-style.

6.2 Multiple templates (Jamba-style)

layer_templates:
  mamba: { ... }
  attn: { ... }

layer_schedule:
  - { template: mamba, repeat: 3 }
  - { template: attn,  repeat: 1 }
  - { template: mamba, repeat: 3 }
  - { template: attn,  repeat: 1 }
  # 8 repetitions = 32 layers (24 mamba + 8 attn)

Different mixers at different depths. Jamba's 3:1 hybrid is expressed this way.

6.3 Progressive stack style

layer_schedule:
  - { template: hyena_dense,  repeat: 8 }    # early: cheap & global
  - { template: mamba_dense,  repeat: 12 }   # middle: linear sequential
  - { template: retnet_dense, repeat: 8 }    # late: chunked refinement
  - { template: attn_moe,     repeat: 4 }    # tail: exact recall + MoE

Progressively more expensive mixers by depth. Vision's CNN → attention hierarchy applied to sequence modeling.

6.4 Pyramid style (Dilated LongNet)

layer_schedule:
  - { template: mamba_prefix, repeat: 4 }
  - { template: sw_1024,      repeat: 4 }
  - { template: sw_4096,      repeat: 8 }
  - { template: sw_16384,     repeat: 8 }
  - { template: global_moe,   repeat: 8 }

Receptive field grows geometrically with depth. A 32-layer temporal pyramid from 1K to 16K to global.

6.5 Layer count

Total layer count is computed automatically:

n_layers = sum(entry.repeat for entry in layer_schedule)

This value appears in validate --report output and becomes the model's real n_layers.

7. head block

Output head specification.

head:
  type: causal_lm
  tie_weights: true

7.1 `type`

Currently only causal_lm (autoregressive next-token prediction). Classifier / encoder-decoder heads aren't in the schema yet.

7.2 `tie_weights`

Whether the LM head shares weights with the input embedding. Must match embedding.tie_word_embeddings.

8. training_hints block

Training-time advisory values. Not part of the architecture itself — they tell downstream trainers "here's how to start training this model".

training_hints:
  init: normal_0.02
  dropout: 0.0
  checkpointing: true
  seed: 1234

8.1 `init`

Weight initialization. normal_0.02 means normal distribution with standard deviation 0.02 (the GPT-2 standard, still in use).

8.2 `dropout`

Default dropout applied across the model. Large LLMs (1B+) use 0.0 — with enough data, overfitting is rare enough that dropout just slows training.

8.3 `checkpointing`

Whether gradient checkpointing is enabled. true saves memory at ~30% speed cost. Useful on tight VRAM.

8.4 `seed`

Random seed for reproducibility.

8.5 `scaling` — μP (v0.1.5 new, W-AS-1)

Declares μP (Maximal Update Parametrization). It is metadata for transferring hyperparameters tuned on a small proxy model to a larger one. Default OFF — if omitted, the model uses standard parametrization. base_width is required when parametrization: mup. The compiler round-trips it as model.config.scaling_*.

training_hints:
  scaling:
    parametrization: mup     # standard | mup  (default standard)
    base_width: 256          # required for mup (proxy width reference)

8.6 `differentiation_objectives` (v0.1.5 new, W-AS-2)

Declares model-level differentiation auxiliary objectives. These are regularization terms that encourage experts/columns to learn distinct functions; default OFF. The compiler emits them as model.config.diff_*.

training_hints:
  differentiation_objectives:
    gate_entropy_floor: 0.5        # router gate entropy floor (optional)
    usage_probe_coef: 0.01         # expert usage probe coefficient
    causal_column_dropout: 0.0     # causal column dropout
    activation_diversity_coef: 0.0 # activation diversity coefficient

8a. tissue block — organ/column declaration (v0.1.5 new, top-level, W-AS-3)

A top-level declaration that groups layer templates into named columns ("organs"). The tissue analyzer uses it for modularity / small-world / bottleneck analysis. Default OFF, and lateral_coef is declaration-only in v0.1.5 (runtime support comes later). The compiler round-trips it as model.config.tissue.

tissue:
  columns:
    - { name: local_processing,   templates: [local_scan],   role: spatial_local }
    - { name: global_integration, templates: [decoder],      role: global_binding }
  connectivity: ring        # inter-column connection topology (optional)
  lateral_coef: 0.0         # declaration-only (v0.1.5)

Validation rules: ERRORs live in spec/schema.py (R-AS-1~3), WARNINGs in spec/realism.py. v0.1.4 YAML behaves identically without declaring any of these three extensions (8.5 / 8.6 / 8a).

9. compatibility block

compatibility:
  compile_target: huggingface

Currently huggingface is the only compile target. vllm, tensorrt-llm, and others may be added later.

10. Cross-field rules (very important)

Individual fields may each be valid while their combination is illegal. The validator enforces the following rules automatically.

10.1 Dimension divisibility

d_model % n_heads == 0       # heads divide d_model evenly
n_heads % n_kv_heads == 0    # GQA groups are integer-sized

Violation: ValidationError.

10.2 Mixer / state compatibility

Mixer	Allowed state
`attention`	`kv_cache: true` allowed
`mamba`	`ssm_state: true` only (no `kv_cache`)
`retnet`	managed internally (omit state block)
`hyena`	stateless (omit state block)

Violation: CompatibilityError.

10.3 FFN / MoE compatibility

Setting a moe: sub-block when ffn.type != moe is a mismatch. The reverse (ffn.type: moe without moe: block) is also a mismatch.

10.4 MoE internal rules

top_k <= experts
experts > 0
top_k > 0
z_loss >= 0
capacity_factor >= 1.0

10.5 head / embedding consistency

head.tie_weights == embedding.tie_word_embeddings

Mismatched values fail validation because the compiler can't decide whether to share weights.

10.6 Realism warnings (non-blocking)

Validation passes but not recommended:

head_dim < 32 or head_dim > 256
target_params diverges from estimate by more than 30%
experts > 64 with small top_k (likely router collapse)
max_seq_len / d_model > 64 (unusual ratio)
family_hint inconsistent with actual mixer composition
RoPE scaling factor > 8

11. Reading the validation report

Sample eulerstack validate --preset X.yml --report output:

==================================================
EulerStack Validation Report
==================================================
Schema: OK
Estimated params: 1.14B (1,141,927,936)
Total layers: 24
Target params: 1.00B

WARNINGS (1):
  [realism.target_params] estimate 1.14B is 114% of target 1.00B

Result: WARN (0 errors, 1 warnings)
==================================================

Reading order:

Schema: OK — schema check pass/fail. If fail, nothing else appears.
Estimated parameters — actual size after compile. Compare against target_params.
Total layers — expanded from layer_schedule.
Errors list — any ValidationError or CompatibilityError.
Warnings list — realism checks. Advisory only.
Result — error / warning count summary.

Any error blocks compilation.

12. Design checklist

Before writing a new YAML, answer these 5 questions:

What task does this model solve? (Summarization / QA / code / multilingual…)
Maximum context length needed? → max_seq_len
How large is the training dataset? → informs target_params
What's the inference GPU environment? → decides d_model, n_kv_heads (GQA ratio)
Training budget? → informs layer count and mixer choice

Before committing, check these 3:

eulerstack validate --preset my_model.yml --report passes
Estimated params within 30% of target_params
eulerstack explain output matches intent

13. Common mistakes

Symptom	Root cause	Fix
`ValidationError: d_model must be divisible by n_heads`	Dimension constraint	Set `d_model` to a multiple of `n_heads`
`CompatibilityError: mamba + kv_cache`	KV cache on a Mamba layer	Use `state.ssm_state: true` instead
`ValidationError: top_k > experts`	MoE misconfig	Ensure `top_k <= experts`
Estimate 2× target	MoE experts counted toward total	Adjust `target_params` to reflect full total
`index out of bounds` during training	`vocab_size` smaller than tokenizer	Wrap with `TokenizedDataset(vocab_size=ir.model.vocab_size)`
Quality drops at long context	`rope_theta` too small	Raise to `500000.0` or `1000000.0`

14. Complete reference example

The complete YAML for a basic Llama-style 1B model. Every field is included.

schema_version: 1

model:
  name: "my-llama-1b"
  family_hint: "llama-like"
  d_model: 1536
  vocab_size: 32000
  max_seq_len: 4096
  n_heads: 16
  n_kv_heads: 4
  mlp_ratio: 4
  dtype: bfloat16
  target_params: 1_000_000_000

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  tie_word_embeddings: true

layer_templates:
  decoder:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
    ffn:
      type: gated_mlp
      activation: swiglu
    norm:
      type: rmsnorm
      position: pre
    residual:
      type: sequential
      scaling: 1.0
    state:
      kv_cache: true

layer_schedule:
  - { template: decoder, repeat: 24 }

head:
  type: causal_lm
  tie_weights: true

training_hints:
  init: normal_0.02
  dropout: 0.0
  checkpointing: true
  seed: 1234

compatibility:
  compile_target: huggingface

Copy this file and change three fields and you have a completely different model. Examples:

d_model: 1536 → d_model: 2048 → 2B scale
mixer.type: attention → mixer.type: mamba → pure Mamba
ffn.type: gated_mlp → ffn.type: moe + moe: { experts: 8, top_k: 2 } → MoE

The feeling of YAML as "composable blueprint" comes quickly once you've copied this file ten times and varied a different field each time.

15. v1 Phase-B extension summary (new primitives)

v1 added 14 primitives in Phase B. YAML syntax and usage of each is covered step-by-step in 09_new_primitives_walkthrough.md. This §15 is just an index — location and one-liner use.

15.1 At `layer_templates[*]` scope

Field	Use	Related §
`mixer.attention.latent_dim`	MLA — compress KV via latent (DeepSeek-V3)	§5.1 attention extension
`mixer.type: branched` + `branched: {...}`	Input-dependent sub-mixer selection (Jamba × per-token)	§5.1 mixer.type extension
`mixer.type: ttt_layer` + `ttt: {...}`	Test-Time Training (Sun 2024)	§5.1 mixer.type extension
`memory: {type, update_at_inference, ...}`	Titans parametric memory	new sub-block
`shape_change: {d_out, projection}`	Hourglass-style shape change	new sub-block

15.2 At `layer_schedule[*]` scope

Field	Use
`override: {...}`	Per-entry partial override (residual.scaling, attention.window, …)
`depth_gating: {enabled, capacity, router}`	Mixture-of-Depths — per-token layer skip
`parallel: [...]` + `merge: {...}`	Monoidal parallel streams (Mamba ∥ Attention)
`integrator: {type, steps, body, output}`	Universal Transformer / Diffusion-LM / Coconut unifier

15.3 At top-level scope

Field	Use
`let: {...}`	Numeric variables — reference as `${let.x * let.y}` (Level-1 arithmetic)
`execution_modes: [...]`	R1-style 2-phase (think/answer) declaration
`transition: {type, token}`	Phase transition rule (`special_token` / `budget_exhausted`)

15.4 Reserved namespaces

Keys under experimental.* / future.* / vendor.<name>.* produce a WARNING, not an error. Reserved space for plugins and in-progress research.

Next steps

Tutorial 4: Compile and Explain — turn this YAML into a real HF model
Tutorial 7: arch walkthrough — case studies of how arch_ presets compose these fields (beginner 2 · intermediate 3 · advanced 5 · expert 12)
Tutorial 9: v1 new primitives walkthrough — step-by-step for each primitive indexed in §15
Mixer overview — how attention / mamba / retnet / hyena / branched / ttt_layer work internally
Tutorial 1: Validate a Spec — review the 3-layer validation structure

← Prev 2. Use Presets 4. Compile & Explain Next →