9. v1 Phase B New Primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)

CLI messages are translated into ko / en / zh / ja / es. Use eulerstack --lang en ... (or EULERSTACK_LANG=en) for English.

This tutorial walks through the 14 new primitives added by EulerStack v1 Phase B. For each primitive you get:

What it enables (research basis / when to reach for it)
Minimal YAML (annotated)
When to use it (practical guidelines)
Runtime status (Core / Component / Plugin-track)

schema_version: 1 is assumed throughout. Every primitive layers on top of the baseline attention / mamba / retnet / hyena / moe blocks.

Full runtime matrix: runtime_primitive_status.md (internal asset).

0. Setup

pip install -e .
eulerstack --lang en schema          # prints the schema summary (5 langs)

Every example below validates instantly with eulerstack validate --preset <file>. Add --report for parameter estimation, realism checks, and reserved-namespace warnings.

1. Per-layer override (B1.1)

Use case: keep the template as-is but nudge a specific schedule group (residual scaling, attention window, ...) without cloning the template.

schema_version: 1

layer_templates:
  attn_dense:
    mixer: { type: attention, attention: { window: null } }
    ffn:   { type: gated_mlp }
    residual: { type: sequential, scaling: 1.0 }

layer_schedule:
  - template: attn_dense
    repeat: 6
    override:                        # first 6 layers only
      residual: { scaling: 0.5 }
      attention: { window: 128 }
  - template: attn_dense
    repeat: 6                        # remaining 6 use template defaults

Whitelist (value-typed only — param count is preserved): - residual.scaling - attention.window, attention.attn_drop - norm.type, norm.position - ffn.activation

Changing mixer.type is not allowed — create a new template instead.

Runtime: ✅ Core. Applied during IR materialization.

2. `let:` + `${…}` expressions (B1.2)

Use case: express dependencies like d_model = n_heads × d_head right in the YAML.

schema_version: 1

let:
  n_heads: 16
  d_head: 64
  layers: 24

model:
  name: "let-demo"
  d_model: ${let.n_heads * let.d_head}       # 1024
  max_seq_len: ${let.layers * 512}           # 12288
  n_heads: ${let.n_heads}
  n_kv_heads: ${let.n_heads // 2}            # 8

layer_schedule:
  - template: decoder
    repeat: ${let.layers}

Allowed operators: + - * / // and parentheses; let.<name> references only. Conditionals, function calls, and string operations are rejected (Level 1).

Runtime: ✅ Core. Resolved in a pre-pass before validation.

3. Reserved namespaces (B0 / B6)

Use case: let plugins and in-progress research coexist with strict schema.

schema_version: 1
# ... (regular fields)

experimental.online_adaptation:       # WARNING only, never ERROR
  reward_source: reward_model
vendor.acme.telemetry:
  endpoint: "https://telemetry.acme"
future.symbolic_interface:
  mode: sidecar

experimental.* — in-progress research
future.* — reserved for v1.x+ additions
vendor.<name>.* — third-party plugins

eulerstack validate --report shows [reserved_namespace] findings. With a plugin registered, the same keys get functional interpretation.

4. MLA — `attention.latent_dim` (B2.1)

Based on: DeepSeek-V3 Technical Report (2024). Compress KV through a shared latent; shrink KV cache memory.

layer_templates:
  mla_decoder:
    mixer:
      type: attention
      attention:
        latent_dim: 384     # half of d_model=768 → ~50% KV cache savings
    ffn: { type: gated_mlp }

Practical guidelines: - Start at latent_dim ≈ d_model / 2 - Biggest wins at long context (≥ 16K) - latent_dim ≥ d_model is rejected

Runtime: ✅ Core. CausalSelfAttention(latent_dim=…) performs real compressed KV projection. Forward, backward, and KV-cache all live.

Demo preset: configs/presets/arch_advanced_mla.yml

5. Branched mixer — `mixer.type: branched` (B2.2)

Based on: Jamba (Lieber et al., AI21, 2024); this generalises the layer-level hybrid to per-token routing across mixer families.

layer_templates:
  branched_layer:
    mixer:
      type: branched
      branched:
        branches:
          ssm:  { type: mamba,     mamba: { variant: mamba2 } }
          attn: { type: attention, attention: {} }
        selector:
          type: learned_gate      # or: top_k
          top_k: 1
          input: hidden
    ffn: { type: gated_mlp }

Constraints: - At least 2 branches - Nesting another branched inside a branch is forbidden in v1

Runtime: 🟡 Fallback — the compiler runs the first branch; the full spec is preserved under config.stack.pattern[]._v1_extras.branched. Real routing is plugin-track.

6. TTT layer — `mixer.type: ttt_layer` (B2.3)

Based on: Sun et al. 2024, "Learning to (Learn at Test Time): RNNs with Expressive Hidden States".

layer_templates:
  ttt_block:
    mixer:
      type: ttt_layer
      ttt:
        inner_model: { type: mlp, hidden: 256 }
        inner_optimizer: sgd
        inner_lr: 0.01
        inner_steps_per_token: 1
    ffn: { type: gated_mlp }
    state:
      ssm_state: true         # persistent inner weights

Runtime: 🔌✅ Plugin-reference available (v1.1). Import eulerstack.plugins.ttt and the plugin registers a real per-token meta-learning TTTBlock into the plugin registry; core modeling then upgrades the Mamba fallback for you. Without the import, the Mamba fallback remains active (common-prompt §7 isolation).

import eulerstack.plugins.ttt            # one-line activation
from eulerstack.compiler.compile import compile_to_hf_model
model = compile_to_hf_model(ir, seed=0)  # real TTTBlock is instantiated

Implementation detail (functional fast-weights): - Per token, the inner loss (MSE reconstruction against the projected target) is differentiated via torch.autograd.grad, and the fast-weight tensors (not the persistent parameters) are updated. - At end of forward we copy the final fast weights back into the persistent nn.Parameter under torch.no_grad() — no in-place version conflicts with outer training.

Tests: tests/test_ttt_plugin_runtime.py (10) + tests/test_runtime_hf_training_e2e.py::TestHFExportTTTTraining.

7. Mixture-of-Depths — `schedule[].depth_gating` (B3.1)

Based on: Raposo et al., ICML 2024.

layer_schedule:
  - template: attn_block
    repeat: 32
    depth_gating:
      enabled: true
      capacity: 0.5           # route 50% of tokens through this layer
      router: top_k           # or: learned_gate

Practical guidelines: - capacity: 0.5 is the paper default - router: top_k is deterministic / reproducible; learned_gate is soft

Runtime: 🟢 Component:

from eulerstack.components.depth_gate import DepthGate

gate = DepthGate(d_model=768, capacity=0.5, router="top_k")
y = gate(x, body_fn=my_attention_layer)

Demo preset: configs/presets/arch_advanced_mod.yml

8. Parallel (monoidal) schedule — `schedule[] parallel:` (B3.2)

Based on: PaLM (2023), Flamingo (2022), Jamba (2024).

layer_schedule:
  - parallel:
      - stream: fast
        body:
          - { template: mamba_block, repeat: 6 }
      - stream: slow
        body:
          - { template: attn_block, repeat: 6 }
    merge:
      type: concat            # concat | add | gated | cross_attn
      projection: true

Constraints: - At least 2 streams - Cannot nest parallel / integrator inside a stream body (flat only) - Unique stream names

Runtime: 🟢 Component:

from eulerstack.components.parallel_stream import ParallelStream

p = ParallelStream(
    [fast_mod, slow_mod],
    d_model=768,
    merge_type="concat",
    merge_projection=True,
    stream_names=["fast", "slow"],
)
y = p(x)

Demo preset: configs/presets/arch_expert_dual_stream.yml

9. Integrator — `schedule[] integrator:` (B3.3)

Based on: Universal Transformer (2019), PonderNet (2021), Diffusion-LM (2022), Coconut (2024) — unified under one primitive. From v1.1 the Neural-ODE reading (Chen et al. 2018) is also supported in core.

9a. `discrete` — K independent-weight steps (default)

layer_schedule:
  - integrator:
      type: discrete          # Diffusion-LM style — K distinct weight copies
      steps: 4
      body: refine_block
      output: token           # or: hidden (Coconut latent reasoning)

Default compile materializes K independent copies. If you want the same module applied K times with shared weights, assemble a DiscreteIntegrator manually:

from eulerstack.components.integrator import DiscreteIntegrator
integrator = DiscreteIntegrator(refine_block, steps=4)   # shared weights

9b. `ode_euler` / `ode_rk4` — Neural-ODE shared weights (v1.1 core ✨)

layer_schedule:
  - integrator:
      type: ode_rk4           # or ode_euler
      steps: 4
      body: refine_block
      output: token

Semantics: the body is interpreted as the derivative f(x) of an ODE; the integrator runs K numerical steps with dt = 1/steps and shares weights across all steps — raising steps does not raise the parameter count.

ode_euler — 1 body call per step (cheapest)
ode_rk4 — 4 body calls per step (classic 4th-order Runge-Kutta)

from eulerstack.components.integrator import ODEIntegrator
odeint = ODEIntegrator(refine_block, steps=4, method="rk4")

Runtime path: EulerStackLayer._forward_ode carries per-step RoPE and attention-mask plumbing through the RK4 iterations. KV cache is disabled along the ODE path (its semantics are ambiguous there).

9c. `ode_adaptive` — reserved (plugin-only)

Adaptive step-size control (torchdiffeq etc.) is still reserved. The validator rejects it until a plugin registers the kind.

10. Memory module — `template.memory:` (B4.1)

Based on: Titans (Behrouz et al., Google, 2024-2025).

layer_templates:
  attn_with_memory:
    mixer: { type: attention, attention: {} }
    ffn:   { type: gated_mlp }
    memory:
      type: neural_memory
      update_at_inference: true
      params:
        hidden: 2048
      inner_lr: 0.001
      persistence: session    # per_query | session | persistent

Runtime: ✅ Core (v1.1). TitansMemoryModule auto-wires into any layer whose template declares memory:. During training, the outer optimiser learns the memory weights jointly; during inference, the standardized hook step_memory_at_inference drives one inner SGD step per call. Survives HF save_pretrained → from_pretrained out of the box.

from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes

register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./titans_model", trust_remote_code=True)

ids = tokenizer("a fact to memorize", return_tensors="pt").input_ids
out = model(ids, output_hidden_states=True)
surprise = model.step_memory_at_inference(out.hidden_states[-1])
# surprise: {"eulerstack.layers.0.titans_memory": 0.42, ...}

Demo preset: configs/presets/arch_expert_titans_memory.yml

11. Shape-change layer — `template.shape_change:` (B4.2)

Based on: Hourglass Transformer (Nawrot et al. 2021).

layer_templates:
  wide_block:    { mixer: { type: attention, attention: {} }, ffn: { type: gated_mlp } }
  bottleneck:
    mixer: { type: attention, attention: {} }
    ffn:   { type: gated_mlp }
    shape_change:
      d_out: 128
      projection: linear     # linear | conv1d | mlp

Runtime: 🔌 Plugin-track. Core modeling assumes constant d_model; the shape-changing residual wire-up ships with a plugin.

12. Reasoning mode — `execution_modes:` + `transition:` (B5)

Based on: DeepSeek-R1 (2025), OpenAI o1/o3 (2024), Quiet-STaR (NeurIPS 2024).

Architecture is unchanged — this is training-recipe and generate-path metadata.

execution_modes:
  - name: think
    max_tokens: 8192
    kv_share: true
    loss_weight: 0.0        # aux phase, excluded from primary LM loss
    visible_to_user: false
  - name: answer
    max_tokens: 2048
    loss_weight: 1.0
    visible_to_user: true

transition:
  type: special_token
  token: "<think_end>"

Quiet-STaR variant (per-token rationale):

execution_modes:
  - name: rationale
    max_tokens: 16
    per_token_rationale: true       # Zelikman 2024
    loss_weight: 0.1
    visible_to_user: false
  - name: answer
    max_tokens: 256
    loss_weight: 1.0
    visible_to_user: true

Runtime: ✅ Core (declarative). Metadata round-trips; a custom generate() honours the phases.

Demo preset: configs/presets/arch_expert_reasoning_r1.yml

13. Reserved integrator types (B3.3, v1.x+)

ode_rk4 and ode_euler were promoted to core in v1.1 (see §9b). Of the originally reserved types, only ode_adaptive remains:

layer_schedule:
  - integrator:
      type: ode_adaptive     # RESERVED — needs a plugin (torchdiffeq)
      steps: 8
      body: refine_block

Status: schema reservation. Adaptive step-size control needs a dedicated library, so it stays on the plugin track. Fixed-step ODEs are already served by ode_rk4.

14. Weight form reservation (future)

Tensor-network weight forms (MPS / MERA / TT) are reserved for a future minor version. Today you can keep the intent recorded via reserved namespace:

vendor.tensor.weight_form: mera

A plugin implementing tensor-network weights can consume this key directly when it ships.

Validate & compile chain

One command validates any combination:

eulerstack --lang en validate --preset my_spec.yml --report

The report includes: - schema ok - estimated params - layer count (integrator-expanded) - realism warnings (RoPE head_dim, MoE expert count, ...) - reserved-namespace warnings (if any)

Export to an HF custom model directory:

eulerstack --lang en compile --preset my_spec.yml --output-dir ./my_model

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("./my_model", trust_remote_code=True)
# config.v1_extensions preserves execution_modes / schedule_kinds / _v1_extras

"Does it really work when you combine everything?" — Capstone preset

A single preset stitches every v1 primitive together: configs/presets/arch_expert_kitchen_sink.yml.

let: + ${…} expressions ✓
reserved namespaces (experimental.* / vendor.*.* / future.*) ✓
per-layer override (B1.1) ✓
MLA (attention.latent_dim) + Titans neural memory on the same layer ✓
all six mixers: mamba / retnet / hyena / attention / branched / ttt_layer ✓
MoE FFN alongside gated_mlp ✓
depth_gating (MoD) + parallel schedule + discrete integrator + ODE RK4 ✓
execution_modes + transition (R1 contract) ✓

TDD coverage in tests/test_kitchen_sink_preset.py (10 tests):

YAML passes validate_v2
normalize_to_ir expands into 20 layers
override preserves Titans memory / ODE metadata (regression guard)
compile_to_hf_model produces an actual EulerStackForCausalLM
one-step forward returns the expected logits shape
save_pretrained → AutoModelForCausalLM.from_pretrained is deterministic (atol=1e-5)
config.v1_extensions preserves execution_modes + ode_rk4
HF training over 25 steps drives loss down (without plugins)
Same training still descends after importing eulerstack.plugins.ttt

In short: "can you really combine all of this and still compile → save_pretrained → train?" is answered by a green regression test, every run.

Next steps

Preset learning order: 02_use_presets.md v1 3-tier (validated → hybrid → experimental)
Full runtime matrix: docs/architectures/runtime_primitive_status.md
Authoritative spec: docs/architectures/yaml_v1_spec.md

← Prev 8. Expert Mini Preset Walkthrough 10. Paper → YAML Case Studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans) Next →

9. v1 Phase B New Primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)

0. Setup

1. Per-layer override (B1.1)

2. let: + ${…} expressions (B1.2)

3. Reserved namespaces (B0 / B6)

4. MLA — attention.latent_dim (B2.1)

5. Branched mixer — mixer.type: branched (B2.2)

6. TTT layer — mixer.type: ttt_layer (B2.3)

7. Mixture-of-Depths — schedule[].depth_gating (B3.1)

8. Parallel (monoidal) schedule — schedule[] parallel: (B3.2)

9. Integrator — schedule[] integrator: (B3.3)

9a. discrete — K independent-weight steps (default)

9b. ode_euler / ode_rk4 — Neural-ODE shared weights (v1.1 core ✨)

9c. ode_adaptive — reserved (plugin-only)

10. Memory module — template.memory: (B4.1)

11. Shape-change layer — template.shape_change: (B4.2)

12. Reasoning mode — execution_modes: + transition: (B5)