9. v1 Phase B New Primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)
CLI messages are translated into ko / en / zh / ja / es. Use
eulerstack --lang en ...(orEULERSTACK_LANG=en) for English.
This tutorial walks through the 14 new primitives added by EulerStack v1 Phase B. For each primitive you get:
- What it enables (research basis / when to reach for it)
- Minimal YAML (annotated)
- When to use it (practical guidelines)
- Runtime status (Core / Component / Plugin-track)
schema_version: 1 is assumed throughout. Every primitive layers on top of
the baseline attention / mamba / retnet / hyena / moe blocks.
Full runtime matrix: runtime_primitive_status.md (internal asset).
0. Setup
pip install -e .
eulerstack --lang en schema # prints the schema summary (5 langs)
Every example below validates instantly with
eulerstack validate --preset <file>. Add --report for parameter
estimation, realism checks, and reserved-namespace warnings.
1. Per-layer override (B1.1)
Use case: keep the template as-is but nudge a specific schedule group (residual scaling, attention window, ...) without cloning the template.
schema_version: 1
layer_templates:
attn_dense:
mixer: { type: attention, attention: { window: null } }
ffn: { type: gated_mlp }
residual: { type: sequential, scaling: 1.0 }
layer_schedule:
- template: attn_dense
repeat: 6
override: # first 6 layers only
residual: { scaling: 0.5 }
attention: { window: 128 }
- template: attn_dense
repeat: 6 # remaining 6 use template defaults
Whitelist (value-typed only — param count is preserved):
- residual.scaling
- attention.window, attention.attn_drop
- norm.type, norm.position
- ffn.activation
Changing mixer.type is not allowed — create a new template instead.
Runtime: ✅ Core. Applied during IR materialization.
2. let: + ${…} expressions (B1.2)
Use case: express dependencies like d_model = n_heads × d_head right
in the YAML.
schema_version: 1
let:
n_heads: 16
d_head: 64
layers: 24
model:
name: "let-demo"
d_model: ${let.n_heads * let.d_head} # 1024
max_seq_len: ${let.layers * 512} # 12288
n_heads: ${let.n_heads}
n_kv_heads: ${let.n_heads // 2} # 8
layer_schedule:
- template: decoder
repeat: ${let.layers}
Allowed operators: + - * / // and parentheses; let.<name> references
only. Conditionals, function calls, and string operations are rejected
(Level 1).
Runtime: ✅ Core. Resolved in a pre-pass before validation.
3. Reserved namespaces (B0 / B6)
Use case: let plugins and in-progress research coexist with strict schema.
schema_version: 1
# ... (regular fields)
experimental.online_adaptation: # WARNING only, never ERROR
reward_source: reward_model
vendor.acme.telemetry:
endpoint: "https://telemetry.acme"
future.symbolic_interface:
mode: sidecar
experimental.*— in-progress researchfuture.*— reserved for v1.x+ additionsvendor.<name>.*— third-party plugins
eulerstack validate --report shows [reserved_namespace] findings.
With a plugin registered, the same keys get functional interpretation.
4. MLA — attention.latent_dim (B2.1)
Based on: DeepSeek-V3 Technical Report (2024). Compress KV through a shared latent; shrink KV cache memory.
layer_templates:
mla_decoder:
mixer:
type: attention
attention:
latent_dim: 384 # half of d_model=768 → ~50% KV cache savings
ffn: { type: gated_mlp }
Practical guidelines:
- Start at latent_dim ≈ d_model / 2
- Biggest wins at long context (≥ 16K)
- latent_dim ≥ d_model is rejected
Runtime: ✅ Core. CausalSelfAttention(latent_dim=…) performs real
compressed KV projection. Forward, backward, and KV-cache all live.
Demo preset: configs/presets/arch_advanced_mla.yml
5. Branched mixer — mixer.type: branched (B2.2)
Based on: Jamba (Lieber et al., AI21, 2024); this generalises the layer-level hybrid to per-token routing across mixer families.
layer_templates:
branched_layer:
mixer:
type: branched
branched:
branches:
ssm: { type: mamba, mamba: { variant: mamba2 } }
attn: { type: attention, attention: {} }
selector:
type: learned_gate # or: top_k
top_k: 1
input: hidden
ffn: { type: gated_mlp }
Constraints:
- At least 2 branches
- Nesting another branched inside a branch is forbidden in v1
Runtime: 🟡 Fallback — the compiler runs the first branch; the full
spec is preserved under config.stack.pattern[]._v1_extras.branched.
Real routing is plugin-track.
6. TTT layer — mixer.type: ttt_layer (B2.3)
Based on: Sun et al. 2024, "Learning to (Learn at Test Time): RNNs with Expressive Hidden States".
layer_templates:
ttt_block:
mixer:
type: ttt_layer
ttt:
inner_model: { type: mlp, hidden: 256 }
inner_optimizer: sgd
inner_lr: 0.01
inner_steps_per_token: 1
ffn: { type: gated_mlp }
state:
ssm_state: true # persistent inner weights
Runtime: 🔌✅ Plugin-reference available (v1.1). Import
eulerstack.plugins.ttt and the plugin registers a real per-token
meta-learning TTTBlock into the plugin registry; core modeling then
upgrades the Mamba fallback for you. Without the import, the Mamba
fallback remains active (common-prompt §7 isolation).
import eulerstack.plugins.ttt # one-line activation
from eulerstack.compiler.compile import compile_to_hf_model
model = compile_to_hf_model(ir, seed=0) # real TTTBlock is instantiated
Implementation detail (functional fast-weights):
- Per token, the inner loss (MSE reconstruction against the projected
target) is differentiated via torch.autograd.grad, and the
fast-weight tensors (not the persistent parameters) are updated.
- At end of forward we copy the final fast weights back into the
persistent nn.Parameter under torch.no_grad() — no in-place
version conflicts with outer training.
Tests: tests/test_ttt_plugin_runtime.py (10) +
tests/test_runtime_hf_training_e2e.py::TestHFExportTTTTraining.
7. Mixture-of-Depths — schedule[].depth_gating (B3.1)
Based on: Raposo et al., ICML 2024.
layer_schedule:
- template: attn_block
repeat: 32
depth_gating:
enabled: true
capacity: 0.5 # route 50% of tokens through this layer
router: top_k # or: learned_gate
Practical guidelines:
- capacity: 0.5 is the paper default
- router: top_k is deterministic / reproducible; learned_gate is soft
Runtime: 🟢 Component:
from eulerstack.components.depth_gate import DepthGate
gate = DepthGate(d_model=768, capacity=0.5, router="top_k")
y = gate(x, body_fn=my_attention_layer)
Demo preset: configs/presets/arch_advanced_mod.yml
8. Parallel (monoidal) schedule — schedule[] parallel: (B3.2)
Based on: PaLM (2023), Flamingo (2022), Jamba (2024).
layer_schedule:
- parallel:
- stream: fast
body:
- { template: mamba_block, repeat: 6 }
- stream: slow
body:
- { template: attn_block, repeat: 6 }
merge:
type: concat # concat | add | gated | cross_attn
projection: true
Constraints:
- At least 2 streams
- Cannot nest parallel / integrator inside a stream body (flat only)
- Unique stream names
Runtime: 🟢 Component:
from eulerstack.components.parallel_stream import ParallelStream
p = ParallelStream(
[fast_mod, slow_mod],
d_model=768,
merge_type="concat",
merge_projection=True,
stream_names=["fast", "slow"],
)
y = p(x)
Demo preset: configs/presets/arch_expert_dual_stream.yml
9. Integrator — schedule[] integrator: (B3.3)
Based on: Universal Transformer (2019), PonderNet (2021), Diffusion-LM (2022), Coconut (2024) — unified under one primitive. From v1.1 the Neural-ODE reading (Chen et al. 2018) is also supported in core.
9a. discrete — K independent-weight steps (default)
layer_schedule:
- integrator:
type: discrete # Diffusion-LM style — K distinct weight copies
steps: 4
body: refine_block
output: token # or: hidden (Coconut latent reasoning)
Default compile materializes K independent copies. If you want the same
module applied K times with shared weights, assemble a
DiscreteIntegrator manually:
from eulerstack.components.integrator import DiscreteIntegrator
integrator = DiscreteIntegrator(refine_block, steps=4) # shared weights
9b. ode_euler / ode_rk4 — Neural-ODE shared weights (v1.1 core ✨)
layer_schedule:
- integrator:
type: ode_rk4 # or ode_euler
steps: 4
body: refine_block
output: token
Semantics: the body is interpreted as the derivative f(x) of an
ODE; the integrator runs K numerical steps with dt = 1/steps and
shares weights across all steps — raising steps does not raise
the parameter count.
ode_euler— 1 body call per step (cheapest)ode_rk4— 4 body calls per step (classic 4th-order Runge-Kutta)
from eulerstack.components.integrator import ODEIntegrator
odeint = ODEIntegrator(refine_block, steps=4, method="rk4")
Runtime path: EulerStackLayer._forward_ode carries per-step RoPE
and attention-mask plumbing through the RK4 iterations. KV cache is
disabled along the ODE path (its semantics are ambiguous there).
9c. ode_adaptive — reserved (plugin-only)
Adaptive step-size control (torchdiffeq etc.) is still reserved. The validator rejects it until a plugin registers the kind.
10. Memory module — template.memory: (B4.1)
Based on: Titans (Behrouz et al., Google, 2024-2025).
layer_templates:
attn_with_memory:
mixer: { type: attention, attention: {} }
ffn: { type: gated_mlp }
memory:
type: neural_memory
update_at_inference: true
params:
hidden: 2048
inner_lr: 0.001
persistence: session # per_query | session | persistent
Runtime: ✅ Core (v1.1). TitansMemoryModule auto-wires into any
layer whose template declares memory:. During training, the outer
optimiser learns the memory weights jointly; during inference, the
standardized hook step_memory_at_inference drives one inner SGD step
per call. Survives HF save_pretrained → from_pretrained out of the box.
from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes
register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./titans_model", trust_remote_code=True)
ids = tokenizer("a fact to memorize", return_tensors="pt").input_ids
out = model(ids, output_hidden_states=True)
surprise = model.step_memory_at_inference(out.hidden_states[-1])
# surprise: {"eulerstack.layers.0.titans_memory": 0.42, ...}
Demo preset: configs/presets/arch_expert_titans_memory.yml
11. Shape-change layer — template.shape_change: (B4.2)
Based on: Hourglass Transformer (Nawrot et al. 2021).
layer_templates:
wide_block: { mixer: { type: attention, attention: {} }, ffn: { type: gated_mlp } }
bottleneck:
mixer: { type: attention, attention: {} }
ffn: { type: gated_mlp }
shape_change:
d_out: 128
projection: linear # linear | conv1d | mlp
Runtime: 🔌 Plugin-track. Core modeling assumes constant d_model; the shape-changing residual wire-up ships with a plugin.
12. Reasoning mode — execution_modes: + transition: (B5)
Based on: DeepSeek-R1 (2025), OpenAI o1/o3 (2024), Quiet-STaR (NeurIPS 2024).
Architecture is unchanged — this is training-recipe and generate-path metadata.
execution_modes:
- name: think
max_tokens: 8192
kv_share: true
loss_weight: 0.0 # aux phase, excluded from primary LM loss
visible_to_user: false
- name: answer
max_tokens: 2048
loss_weight: 1.0
visible_to_user: true
transition:
type: special_token
token: "<think_end>"
Quiet-STaR variant (per-token rationale):
execution_modes:
- name: rationale
max_tokens: 16
per_token_rationale: true # Zelikman 2024
loss_weight: 0.1
visible_to_user: false
- name: answer
max_tokens: 256
loss_weight: 1.0
visible_to_user: true
Runtime: ✅ Core (declarative). Metadata round-trips; a custom
generate() honours the phases.
Demo preset: configs/presets/arch_expert_reasoning_r1.yml
13. Reserved integrator types (B3.3, v1.x+)
ode_rk4 and ode_euler were promoted to core in v1.1 (see §9b).
Of the originally reserved types, only ode_adaptive remains:
layer_schedule:
- integrator:
type: ode_adaptive # RESERVED — needs a plugin (torchdiffeq)
steps: 8
body: refine_block
Status: schema reservation. Adaptive step-size control needs a
dedicated library, so it stays on the plugin track. Fixed-step ODEs
are already served by ode_rk4.
14. Weight form reservation (future)
Tensor-network weight forms (MPS / MERA / TT) are reserved for a future minor version. Today you can keep the intent recorded via reserved namespace:
vendor.tensor.weight_form: mera
A plugin implementing tensor-network weights can consume this key directly when it ships.
Validate & compile chain
One command validates any combination:
eulerstack --lang en validate --preset my_spec.yml --report
The report includes: - schema ok - estimated params - layer count (integrator-expanded) - realism warnings (RoPE head_dim, MoE expert count, ...) - reserved-namespace warnings (if any)
Export to an HF custom model directory:
eulerstack --lang en compile --preset my_spec.yml --output-dir ./my_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("./my_model", trust_remote_code=True)
# config.v1_extensions preserves execution_modes / schedule_kinds / _v1_extras
"Does it really work when you combine everything?" — Capstone preset
A single preset stitches every v1 primitive together:
configs/presets/arch_expert_kitchen_sink.yml.
let:+${…}expressions ✓- reserved namespaces (
experimental.*/vendor.*.*/future.*) ✓ - per-layer override (B1.1) ✓
- MLA (
attention.latent_dim) + Titans neural memory on the same layer ✓ - all six mixers: mamba / retnet / hyena / attention / branched / ttt_layer ✓
- MoE FFN alongside gated_mlp ✓
- depth_gating (MoD) + parallel schedule + discrete integrator + ODE RK4 ✓
execution_modes+transition(R1 contract) ✓
TDD coverage in tests/test_kitchen_sink_preset.py (10 tests):
- YAML passes
validate_v2 normalize_to_irexpands into 20 layers- override preserves Titans memory / ODE metadata (regression guard)
compile_to_hf_modelproduces an actualEulerStackForCausalLM- one-step forward returns the expected logits shape
save_pretrained → AutoModelForCausalLM.from_pretrainedis deterministic (atol=1e-5)config.v1_extensionspreservesexecution_modes+ode_rk4- HF training over 25 steps drives loss down (without plugins)
- Same training still descends after importing
eulerstack.plugins.ttt
In short: "can you really combine all of this and still
compile → save_pretrained → train?" is answered by a green regression
test, every run.
Next steps
- Preset learning order: 02_use_presets.md v1 3-tier (validated → hybrid → experimental)
- Full runtime matrix: docs/architectures/runtime_primitive_status.md
- Authoritative spec: docs/architectures/yaml_v1_spec.md