10. Paper → YAML Case Studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans)

This tutorial walks through four recent architectures as a dialogue between a professor (P) and a student (S). Unlike a quick cheat-sheet, the goal here is not just "write this YAML" but "understand why each field is there". Every case follows the same eight-step structure.

Step	Content
§X.1 Background & motivation	Why did this paper appear? What bottleneck does it fix, and how does it relate to prior work?
§X.2 Mechanism in depth	Equations and ASCII diagrams of the internals. How it differs from alternatives.
§X.3 YAML derivation	Start from an empty YAML and fill in each field with its rationale.
§X.4 Validate & compile	Actual CLI commands, expected output, parameter-count arithmetic.
§X.5 Parameter tuning	Sensible ranges for `latent_dim`, `capacity_factor`, etc.
§X.6 Traps & debugging	4-6 common pitfalls beginners hit, mapped to error messages.
§X.7 Combining with other primitives	Orthogonality / conflict matrix for 2nd-order combinations.
§X.8 Further reading	Paper links, follow-up research, relevant presets and tutorials.

Table of contents: - Case 1. DeepSeek-V3 — MLA (Multi-head Latent Attention) - Case 2. Jamba — Mamba × Attention × MoE hybrid - Case 3. DeepSeek-R1 — 2-phase reasoning (execution_modes) - Case 4. Titans — neural memory that learns at inference time - Closing — common shape across the four cases

Case 1. DeepSeek-V3 — MLA (Multi-head Latent Attention)

1.1 Background & motivation

S: Professor, DeepSeek-V3 (2024-12) makes a big deal of Multi-head Latent Attention. It looks like just another attention variant — why the hype?

P: Context first. In 2023-24 "long context" became the main axis of LLM competition. Llama 2's 4K → Llama 3's 8K → 32K → 128K in a year. The problem is this race came with unequal costs.

S: Unequal how?

P: Attention's compute cost collapsed thanks to FlashAttention, but the memory cost stayed. Specifically the KV cache size.

KV cache ≈ 2 · n_layers · n_heads · head_dim · seq_len · batch · dtype_bytes

For Llama 3 70B at 128K context in bf16 that's about 5 GB per layer, or ~400 GB across 80 layers. That won't fit on a single H100.

S: What did the field do about it?

P: Three paths:

Grouped-Query Attention (GQA) — Llama 2/3. Heads share K/V, cutting cache up to 8×. Cost: quality regression.
Sliding-window attention — Mistral / Longformer. Give up global attention, look inside a window only. Cost: weaker long-range.
Multi-head Latent Attention (MLA) — DeepSeek-V2/V3. Compress K/V into one low-dimensional latent and only cache that latent. No quality loss, memory savings.

Option 3 is MLA. GQA sacrifices head diversity, sliding window sacrifices locality, MLA sacrifices dimensionality, but it rebuilds diversity at query time via up-projection — which is why it (roughly) matches MHA quality.

S: Is MLA the current king?

P: As of 2025, MLA is effectively the standard for long-context memory efficiency. DeepSeek-V2/V3, some Qwen 2.5 variants, and Kimi's MoBA family all adopted it. Llama-family is still on GQA but next generations will likely move to MLA.

S: What's the difference between DeepSeek-V2 and V3?

P: MLA itself was introduced in V2. V3 is a scaled version (236B → 671B, MTP / Multi-Token Prediction training trick). The MLA structure is identical. The YAML we'll write is just "one MLA line" — the size comes from hyperparameters.

1.2 Mechanism in depth

S: Can I see MLA in equations?

P: Standard MHA first. Given x ∈ R^{T×d}:

Q = x W_q    (d → d)
K = x W_k    (d → d)
V = x W_v    (d → d)

attention = softmax(Q K^T / √d_h) V

MHA splits d into n_heads × head_dim, each head runs independently. What lives in the KV cache is both K and V, each shaped (B, n_heads, T, head_dim).

S: And MLA?

# Down-projection (once, shared between K and V)
kv_latent = x W_kv      (d → l)              ← this is what you cache

# Up-projection (reconstructed at compute time)
K = kv_latent W_k_up    (l → d)              ← rebuilt each time
V = kv_latent W_v_up    (l → d)

# Q stays standard
Q = x W_q               (d → d)

S: So "throw K and V away, keep only the latent that built them."

P: Exactly. The cache drops from 2 × d → 1 × l. With l = d/2 that's a 75% reduction in cache memory. With l = d/4 it's 87.5%. Without sharing heads.

S: But now you do an up-projection on every step. Doesn't that add compute?

P: Two tricks mitigate it:

Absorb trick: at inference you can fuse Q W_k_up^T ahead of time so the up-projection effectively disappears. DeepSeek does this.
Parallelism: the up-projection is a single matmul, and because attention is memory-bound on long context, the extra FLOPs cost is almost free relative to the KV read.

┌─────────────────────────────────────────────────┐
│  Cache layout comparison                         │
├─────────────────────────────────────────────────┤
│  MHA:         [K: H×D]  [V: H×D]     → 2HD/token │
│  GQA (4:1):   [K: H/4×D] [V: H/4×D]  → HD/2      │
│  MLA l=d/2:   [kv_latent: L]          → L = D/2   │
│                                                    │
│  d_model=768, 32 layers, 128K ctx, bf16:          │
│    MHA:   24 GB                                   │
│    GQA:    6 GB                                   │
│    MLA:    6 GB  (with MHA-level quality)         │
└─────────────────────────────────────────────────┘

S: What about parameter count?

P: Per attention block:

MHA: W_q, W_k, W_v, W_o each d² → 4d²
GQA: W_q (d²), W_k, W_v (d×d/g), W_o (d²) → 2d² + 2d²/g
MLA: W_q (d²), W_kv (d×l), W_k_up, W_v_up (l×d), W_o (d²) → 2d² + 3dl

With l = d/2 that's 2d² + 1.5d² = 3.5d² — MLA has fewer parameters than MHA and 25 % of the cache and comparable quality. Hence the "free lunch" framing.

1.3 YAML derivation

S: What does this look like in EulerStack?

P: Start empty. Model metadata first.

schema_version: 1

model:
  name: deepseek-v3-clone
  d_model: 768          # small for education. Real V3 has d_model=7168
  vocab_size: 32000
  max_seq_len: 32768    # 32K; V3 pushes to 128K
  n_heads: 12           # head_dim = 768/12 = 64
  n_kv_heads: 12        # MLA is orthogonal to GQA; both can coexist
  mlp_ratio: 4

S: n_kv_heads = n_heads means "no GQA"?

P: Right. MLA already compresses KV through a latent; layering GQA on top gets redundant. DeepSeek's paper uses MLA only. If you wanted "MLA + GQA 4:1" for aggressive savings you could set n_kv_heads: 3, but quality falls fast. Not recommended.

S: Tokenizer and embedding next.

P: DeepSeek's real tokenizer is a proprietary 100K BPE; for educational use we point at gpt2 just to pass validation.

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0          # large theta for long context
  tie_word_embeddings: true

S: Why is rope_theta so big?

P: Great question. RoPE's default theta = 10000 dates to GPT-2's 4K-context era. Scale the context to 32K+ and the high-frequency axes start to wrap around — the model loses its ability to distinguish far-apart positions. Bigger theta means longer periods, keeping each position distinguishable. Llama 3 uses 500K, DeepSeek goes up to 10M.

S: Now layer_templates.

P: This is where MLA actually lives.

layer_templates:
  mla_decoder:
    mixer:
      type: attention
      attention:
        latent_dim: 384        # ← MLA in one line; half of d_model=768
        qkv_bias: false
    ffn:
      type: gated_mlp
      activation: swiglu
    norm:
      type: rmsnorm
      position: pre
    residual:
      type: sequential
      scaling: 1.0
    state:
      kv_cache: true

S: latent_dim: 384 is really the whole thing?

P: Yes. EulerStack's CausalSelfAttention instantiates five projections (Q, kv_latent, K_up, V_up, out) when latent_dim is set; otherwise the standard fused QKV path. At the spec layer you just declare "use MLA" and the runtime does the rest.

S: Schedule:

layer_schedule:
  - template: mla_decoder
    repeat: 12          # educational. V3 has 61 layers.

head:
  type: causal_lm
  tie_weights: true

compatibility:
  compile_target: huggingface

Done. That is a DeepSeek-V3 MLA structure.

1.4 Validate & compile

S: How do I check it?

P: Save it, one CLI call.

eulerstack --lang en validate --preset deepseek_v3_clone.yml --report

Expected output (abridged):

[validation] schema ... PASS
[validation] cross-field ... PASS
  - mixer.latent_dim=384 < d_model=768 ✓
  - d_model=768 % n_heads=12 == 0 ✓
  - head_dim = 64 (within 32-256 recommended range) ✓
[params] estimated: ~85M
  - embedding (learned + tied): ~25M
  - attention × 12: ~32M  (MLA savings reflected)
  - ffn (swiglu) × 12: ~28M
[realism] PASS
  - head_dim: 64 (recommended range)
  - rope_theta: 500000 (long-context justification OK)
  - n_kv_heads == n_heads: GQA off, MLA handles KV compression alone
OK: deepseek_v3_clone.yml is valid.

S: Is 85M correct? Can I cross-check?

Embedding (tied): vocab × d = 32000 × 768 ≈ 24.6M
Per attention layer (MLA, l=384): 2 × 768² + 3 × 768 × 384 = 1.18M + 0.88M = 2.06M
Per FFN (SwiGLU, ratio=4): 3 matrices × 768 × 3072 ≈ 7.1M
Per norm: 2 × 768 ≈ 1.5K (negligible)
12 layers: 12 × (2.06M + 7.1M) ≈ 110M
Tied embed: 24.6M once

Total ~134M. The CLI estimate is approximate and can be ±20%.

S: Let's actually build the HF model.

eulerstack --lang en compile --preset deepseek_v3_clone.yml --output-dir ./v3_clone

from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes

register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./v3_clone", trust_remote_code=True)

from eulerstack.blocks.attention import CausalSelfAttention
mla_layers = [m for m in model.modules()
              if isinstance(m, CausalSelfAttention) and m.latent_dim is not None]
print(f"MLA layers: {len(mla_layers)}, latent_dim={mla_layers[0].latent_dim}")
# → MLA layers: 12, latent_dim=384

print([n for n in mla_layers[0].state_dict() if 'proj' in n or 'latent' in n])
# → ['q_proj.weight', 'kv_latent.weight', 'k_up.weight', 'v_up.weight', 'out_proj.weight']

Seeing kv_latent / k_up / v_up in the state dict means real MLA is instantiated.

1.5 Parameter tuning guide

S: Is there a rule for picking latent_dim?

P: Practical table:

`latent_dim`	Cache savings	Quality impact	Use case
`d_model`	0%	Identical to MHA	(meaningless — validator rejects)
`d_model × 0.75`	~25%	~0	Conservative — first MLA adoption
`d_model × 0.5`	~75%	~0	Recommended starting point. DeepSeek default
`d_model × 0.33`	~83%	Slight	Tight-memory 128K+
`d_model × 0.25`	~87.5%	Noticeable	Extreme — require A/B
`d_model × 0.125`	~94%	Severe	Avoid

S: What's its relationship with head_dim?

P: Keep latent_dim an integer multiple of head_dim. The KV cache stride stays clean and FlashAttention-like kernels map well. With d_model=768, head_dim=64, sensible values are {128, 192, 256, 384, 512}.

S: How long can max_seq_len realistically go with MLA?

P: MLA's win grows with context. Below 2K it barely matters. Above 16K you feel it. By 128K it's essentially mandatory on single-node serving with A100/H100. DeepSeek runs at 128K out of the box.

S: Relation between rope_theta and max_seq_len?

P: Rule of thumb:

rope_theta ≈ max(10000, 10000 × (max_seq_len / 2048)²)

2K: 10K
4K: 40K
8K: 160K
32K: 2.6M (DeepSeek actually uses 500K and it suffices)
128K: 42M (at this point YaRN / NTK-aware scaling typically takes over)

1.6 Traps & debugging

S: What are the most common mistakes authoring an MLA YAML?

P: Six:

Trap 1. latent_dim >= d_model

ValidationError (R): mla_latent_dim_range — latent_dim=1024 exceeds d_model=768
Fix: set latent_dim < d_model (commonly d_model/2)

→ MLA means "compression"; ≥ d_model is meaningless.

Trap 2. latent_dim is odd or not a multiple of head_dim

warning: mla_latent_dim_range — latent_dim=100 is not a multiple of head_dim=64

→ Validate passes but the KV cache stride breaks, causing subtle perf regressions. Round to a multiple of 64.

Trap 3. "MLA alone solves long context" → MLA only cuts memory. If rope_theta is too small, positions still collide. Tune both together.

Trap 4. MLA layer with kv_cache: false

state:
  kv_cache: false        # ← eliminates MLA's raison d'être

→ MLA only pays off with the cache on. During pure training it may be off; for inference deployment it must be on.

Trap 5. Overlapping GQA and MLA

n_kv_heads: 2         # 4:1 GQA
attention:
  latent_dim: 128     # plus MLA

→ Technically legal but quality falls quickly. MLA alone is enough; keep n_kv_heads == n_heads.

Trap 6. "I can just load DeepSeek-V3's pretrained weights into my MLA YAML" → EulerStack ships randomly initialised models. DeepSeek-V3's real weights require a separate downloader and a conversion script (roadmap v1.2).

1.7 Combining with other primitives

S: What can I combine MLA with?

P: Four combinations matter in practice.

1.7.1 MLA + MoE (DeepSeek-V3's actual architecture)

layer_templates:
  mla_moe:
    mixer:
      type: attention
      attention: { latent_dim: 384 }
    ffn:
      type: moe
      moe: { experts: 32, top_k: 3, router: softmax, capacity_factor: 1.25 }

Orthogonal — MLA compresses attention memory, MoE scales FFN capacity. They never fight for the same real estate. This is what DeepSeek-V3 actually ships.

1.7.2 MLA + Jamba-style hybrid

layer_schedule:
  - template: mla_decoder
    repeat: 6
  - template: mamba_block
    repeat: 18

Complementary — only 1/4 of the layers are MLA; the rest are Mamba. Global-retrieval layers keep KV-cache savings; the rest enjoy Mamba's O(N). A plausible Samba / Jamba 1.5 upgrade.

1.7.3 MLA + Mixture-of-Depths

layer_schedule:
  - template: mla_decoder
    repeat: 24
    depth_gating:
      enabled: true
      capacity: 0.5
      router: top_k

Synergy — MoD skips 50% of tokens through this layer, so MLA's KV cache only grows by the surviving tokens. MoD's token savings multiply with MLA's memory savings. Very strong for 128K contexts.

1.7.4 MLA + execution_modes (R1 + V3)

execution_modes:
  - { name: think, max_tokens: 16384, kv_share: true,  loss_weight: 0.0 }
  - { name: answer, max_tokens: 2048, loss_weight: 1.0 }

Ideal — R1-style reasoning has long think phases (16K+) where KV cache is the bottleneck. MLA solves it directly. Likely to become the standard shape for next-gen reasoning models.

1.8 Further reading

Paper: DeepSeek-V3 Technical Report, arXiv:2412.19437 (2024)
Prior: DeepSeek-V2, arXiv:2405.04434 — MLA first introduced
Follow-ups: multiple open-source FlashAttention-3 integrations with MLA
EulerStack presets: configs/presets/arch_advanced_mla.yml, llm_{0p1b,0p8b,2b,4b,16b}_mla.yml
Related tutorial: 09 new primitives §4
Next experiment: split arch_advanced_mla's schedule as mla_decoder:repeat=8 → mamba_block:repeat=4 → mla_decoder:repeat=8 — a 3:1 hybrid. Measure whether MLA and Mamba complement each other.

Case 2. Jamba — Mamba × Attention × MoE hybrid

2.1 Background & motivation

S: Jamba mixes Mamba and attention — is it really just alternating them?

P: Deeper than that. 2023-24 was an SSM (state-space model) renaissance. S4 → H3 → Mamba built a serious case that "attention isn't necessary".

S: What's Mamba's edge?

P: Mamba processes sequences in linear time O(N) with near- attention expressiveness. Concretely:

Compute: attention O(N²) vs Mamba O(N)
Memory: attention's KV cache grows with N; Mamba's state is fixed size O(1)
Throughput: Mamba is 10-20× faster on long sequences

S: Then why didn't "no attention" win outright?

P: Mamba has a weakness: in-context learning (ICL). Precisely pointing to a past token — "pick the exact key/value pair from context" — is much worse than attention.

S: Why? Mamba also remembers the past.

P: It remembers compressed. Attention's KV cache carries every past token losslessly, and softmax can point to any of them. Mamba state holds the entire past in a fixed-size latent — lossy by design. That compression hurts ICL.

Empirically a Mamba-only model roughly matches attention on LAMBADA (short next-word) but loses badly on needle-in-haystack at 128K.

S: So Jamba comes in.

P: AI21 Labs, March 2024: Jamba-1 validated "mix Mamba and attention." Their ratio landed at Mamba 7 : Attention 1.

S: Why 7:1?

P: Empirically tuned. The minimum attention ratio to recover needle performance was 1/8 (12.5%). Going lower collapsed ICL; going higher gave up Mamba's speed. Sweet spot.

S: How does Jamba differ from similar work?

Work	Composition	Feature
Striped Hyena (Together AI, 2023)	Hyena + Attn	Hyena as attention-substitute
Samba (Microsoft, 2024)	Mamba + SWA 1:1	Sliding-window local + SSM global
Jamba (AI21, 2024)	Mamba + Attn 7:1 + MoE	MoE replaces FFN for capacity
Jamba 1.5 (AI21, 2024-08)	Same shape, scaled	256K context

Jamba's distinctive move is MoE in the FFN slot. Attention is already down to 1/8, so capacity goes into the FFN.

2.2 Mechanism in depth

S: Let me see a block diagram.

P: Jamba-32-layer schedule:

Layer 1:  Mamba    Layer 5:  Mamba     Layer 9:  Mamba     ...
Layer 2:  Mamba    Layer 6:  Mamba     Layer 10: Mamba
Layer 3:  Mamba    Layer 7:  Mamba     Layer 11: Mamba
Layer 4:  Attn     Layer 8:  Attn      Layer 12: Attn

  ↑ 8× repetitions of this pattern → 32 layers, 8 attention layers

Every 8th layer is attention, and at those positions the FFN is MoE:

Layer 4 (Attn + MoE):
  x → norm → attention → + residual
    → norm → MoE(8 experts, top-2) → + residual

Layers 1-3 (Mamba):
  x → norm → mamba2 → + residual
    → norm → gated_mlp(SwiGLU) → + residual

S: Why is the FFN on Mamba layers plain gated_mlp, not MoE?

P: Two reasons.

Mamba already blends mixer + FFN. A Mamba block contains state update and input-dependent gating — it's not a "pure mixer" the way attention is. Putting MoE on top is redundant.
MoE and attention scale on different axes. Attention = token communication; MoE = per-token processing variety. Putting them together is complementary.

S: Why does 1/8 attention suffice for ICL?

P: Solving a needle task only needs "pick a past token exactly" once per forward pass through the model. With 8 attention layers you get 8 global-lookup opportunities. The Mamba layers in between handle local patterns.

2.3 YAML derivation

S: Build the Jamba YAML from scratch.

P: Model metadata first.

schema_version: 1

model:
  name: jamba-clone
  d_model: 768               # educational. Jamba-1 is d_model=4096
  vocab_size: 32000
  max_seq_len: 32768
  n_heads: 12
  n_kv_heads: 4              # GQA 3:1 — extra savings on the attention layers
  mlp_ratio: 4

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

S: n_kv_heads: 4 means GQA 3:1. Does Jamba actually use GQA?

P: Yes. Only 1/8 of the layers are attention, but those few are responsible for the entire KV cache. Adding GQA compresses it further. The real Jamba uses GQA 4:1.

S: Next, embedding.

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  tie_word_embeddings: true

S: Wait — I heard Mamba doesn't need positional encoding.

P: Mamba is recurrent and encodes position implicitly in its state. But attention layers still need it. Jamba applies RoPE only to attention layers. In the YAML you declare positional: rope and the Mamba layers transparently ignore it.

S: Templates.

P: Two — one for Mamba, one for Attention+MoE.

layer_templates:
  mamba_block:
    mixer:
      type: mamba
      mamba:
        variant: mamba2        # Mamba2 (SSD formulation)
        d_state: 128           # internal state dimension
        d_conv: 4              # local conv kernel size
        expand: 2              # internal expansion ratio
    ffn:
      type: gated_mlp
      activation: swiglu
    norm: { type: rmsnorm, position: pre }
    state: { ssm_state: true, kv_cache: false }

  attn_moe:
    mixer:
      type: attention
      attention:
        qkv_bias: false
    ffn:
      type: moe
      moe:
        experts: 8
        top_k: 2
        router: softmax
        capacity_factor: 1.25
        z_loss: 0.001
    norm: { type: rmsnorm, position: pre }
    state: { kv_cache: true }

S: Is d_state: 128 big or small?

P: Mamba2 defaults to 64-128. Larger state = more "memory capacity" but slower. Jamba picks 128.

S: And capacity_factor: 1.25?

P: MoE load-balancing parameter. 1.0 = each expert gets slots for exactly tokens / experts tokens. 1.25 adds a 25 % buffer so that training-time routing imbalances don't overflow. At inference you can lower to 1.0.

S: The schedule is where Jamba lives.

layer_schedule:
  # 8 cycles × (3 Mamba + 1 Attn+MoE) = 32 layers
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1

S: That's verbose.

P: You can tidy the arithmetic with let::

let:
  mamba_per_cycle: 3
  total_cycles: 8
# use ${let.mamba_per_cycle * 8 + 8} = 32 in places you need the
# total layer count.

YAML doesn't have for-loops (v1 is Level-1 arithmetic only); that trade-off is characteristic of dedicated design languages. HDLs hit the same wall early: Verilog lacked repetition until generate ... endgenerate was added, and VHDL solved it with its generate statement. EulerStack's schedule: iterator is on the v1.x roadmap for the same reason.

S: Head and hints:

head:
  type: causal_lm
  tie_weights: true

training_hints:
  init: normal_0.02
  dropout: 0.0
  checkpointing: true          # 32 layers — save memory
  seed: 1234

compatibility:
  compile_target: huggingface

2.4 Validate & compile

S: Validate output?

eulerstack validate --preset jamba_clone.yml --report

[validation] schema ... PASS
[validation] cross-field ... PASS
  - mamba_block.state.ssm_state=true ↔ mixer=mamba ✓
  - attn_moe.state.kv_cache=true ↔ mixer=attention ✓
  - MoE: top_k=2 ≤ experts=8 ✓
[params] estimated: ~1.04B
  - embedding: ~25M
  - mamba × 24: ~240M
  - attn+moe × 8: ~280M
  - ffn gated_mlp (mamba side): ~480M
  - norms, head: ~15M
[realism] PASS
  - positional: rope + mamba mixer: OK (attention is present)
  - MoE expert ratio: 8 × 1/4 layers: OK
OK: jamba_clone.yml is valid.

S: Why are active parameters smaller?

P: MoE. Total ~1.04B, but each forward activates:

embedding (25M)
mamba × 24 (240M)
attn+moe × 8: attn (~50M) + only 2/8 experts → ~17.5M

Active total: ~330M. "Small-model compute, big-model capacity" — the classic MoE win. DeepSeek-V3 pushes this to 671B total / 37B active.

2.5 Parameter tuning guide

S: Why Mamba:Attn = 7:1?

P: Practical guide:

Mamba:Attn	ICL (needle)	Speed	Use case
0:1 (pure attn)	Best	Slow (O(N²))	Baseline. Llama-family
1:1 (Samba)	Nearly best	Medium	Balanced, long-ctx first
3:1	High	Fast	General recommendation
7:1 (Jamba)	Sufficient	Fast	Jamba default
15:1	Noticeable drop	Very fast	Risky — needle fails
1:0 (pure Mamba)	Very low	Fastest	Special use only

S: experts × top_k?

`experts` × `top_k`	Active ratio	Model	Note
8 × 2	25%	Mixtral, Jamba	Standard. Stable
16 × 2	12.5%	Phi-3.5-MoE	More sparse, more capacity
64 × 6	9.4%	DeepSeek-V2	Fine-grained
256 × 8	3.1%	DeepSeek-V3	Extremely fine-grained

Fine-grained = more experts with smaller dim, top_k grows accordingly. Each expert becomes a narrower specialist. DeepSeek pushed this and showed the gains.

S: capacity_factor?

1.0: theoretical minimum — load-balancing must be near-perfect.
1.25: recommended default. 25% buffer during training.
1.5: very unbalanced early training; drop to 1.25 once stable.
2.0+: rarely useful — wasted memory.

2.6 Traps & debugging

Trap 2.1. Mamba-only model, ICL collapses → Symptom: needle benchmark crashes. Cause: 0% attention. Fix: keep at least 1/8 attention.

Trap 2.2. ssm_state: false on a Mamba layer

state:
  ssm_state: false     # ← Mamba without state

→ Mamba is defined by its state; the cross-field validator raises CompatibilityError. ssm_state: true is required.

Trap 2.3. Attention sub-object on a Mamba layer

mixer:
  type: mamba
  attention:             # ← wrong sub-object
    latent_dim: 128

→ Validator: "sub-object 'attention' not allowed when type='mamba'".

Trap 2.4. MoE top_k > experts

moe: { experts: 4, top_k: 8 }   # impossible

→ ValidationError (R14): top_k=8 > experts=4.

Trap 2.5. positional: none with any attention → Allowed, but attention can't distinguish positions → unstable training. Jamba's answer is positional: rope; Mamba transparently ignores it.

Trap 2.6. Starting training with capacity_factor: 1.0 → Early routing is very unbalanced → many dropped tokens → loss spikes. Start at 1.25, lower once stable.

Trap 2.7. MoE on every single layer

- template: attn_moe
  repeat: 32

→ Total params × 8, active only 2/8. Jamba's philosophy is "MoE only on the attention layers." Mixtral-style "MoE everywhere" is a separate design.

2.7 Combining with other primitives

2.7.1 Jamba + MLA — next-gen inference

layer_templates:
  attn_mla_moe:
    mixer:
      type: attention
      attention: { latent_dim: 384 }
    ffn:
      type: moe
      moe: { experts: 8, top_k: 2, router: softmax, capacity_factor: 1.25 }

Strong — Jamba's 1/8 attention holds the whole KV cache, so MLA on top is pure win. Conceptually DeepSeek-V3 × Jamba.

2.7.2 Jamba + execution_modes

execution_modes:
  - { name: think, max_tokens: 8192, kv_share: true, loss_weight: 0.0 }
  - { name: answer, max_tokens: 2048, loss_weight: 1.0 }

Complementary — Mamba handles long think in O(N), attention layers do retrieval for the final answer.

2.7.3 Jamba + Titans memory

layer_templates:
  attn_with_mem:
    mixer: { type: attention, attention: {} }
    ffn: { type: moe, moe: {...} }
    memory:
      type: neural_memory
      update_at_inference: true
      params: { hidden: 1024 }
      inner_lr: 0.001

Experimental — attach Titans memory only to the attention layers. Test-time learning within multi-turn sessions.

2.7.4 Jamba + Mixture-of-Depths

layer_schedule:
  - template: mamba_block
    repeat: 3
  - template: attn_moe
    repeat: 1
    depth_gating:
      enabled: true
      capacity: 0.5
      router: top_k

Synergy — only 50% of tokens go through the retrieval (attention) layer. "Easy tokens go through Mamba; only the hard ones need attention."

2.8 Further reading

Paper: Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model," arXiv:2403.19887 (2024)
Follow-up: Jamba 1.5 report (2024-08)
Background: Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752 (2023)
Related hybrids: Samba (Microsoft, 2024), Zamba (Zyphra, 2024)
EulerStack presets: configs/presets/arch_advanced_jamba.yml, configs/presets/arch_advanced_samba.yml, llm_{0p1b,0p8b,2b,4b,16b}_jamba.yml
Related tutorial: 09 new primitives, mixers/02_mamba.md
Next experiment: change 7:1 to 3:1 and compare against Samba. Needle score at 32K context — how does quality change?

Case 3. DeepSeek-R1 — 2-phase reasoning (execution_modes)

3.1 Background & motivation

S: R1 "reasons first, answers later" — does the architecture itself change?

P: The answer is almost no. R1's architecture is essentially DeepSeek-V3 (MLA + fine-grained MoE). What changes is the training and generation strategy, not the structure.

S: Then why all the buzz?

P: Late 2024 OpenAI announced o1, a black-box model trained with RL to do long chain-of-thought reasoning. The field wondered "how does it work?". In January 2025 DeepSeek released R1 open source, demonstrating what o1 essentially does. The recipe:

Train a base model (DeepSeek-V3) to produce long reasoning inside <think> tags.
Learn "reasoning that reaches correct answers" via RL (a GRPO variant).
At serve time, hide the think phase from the user and output only answer.

S: So structurally it's V3?

P: Exactly. The real innovation is "RL alone can bootstrap reasoning", not a new attention. It's a training paper more than an architecture paper.

S: How does EulerStack model R1 then?

P: v1 exposes execution_modes + transition as declarative metadata. Keep the architecture standard; carve a "this model does 2-phase reasoning" contract into the config. Training scripts and the custom generate() loop consume it.

S: Why not an architecture field?

P: Three reasons:

The structure doesn't actually change. Putting it in a structure field would be lying.
Many phase strategies exist — R1 (think/answer), Quiet-STaR (per-token rationale), o3 (multi-round). Metadata lets you swap strategies at the YAML-diff level.
Training / serving pipelines share one contract. Training reads loss_weight; serving reads visible_to_user; eval reads kv_share. One metadata block.

3.2 Mechanism in depth

S: Show me R1's training pipeline in detail.

P: Four stages.

Stage 1 — Cold start

Input:   "Prove that √2 is irrational."
Output:  "<think>
           Assume √2 = a/b in lowest terms...
           Then a² = 2b², so a is even.
           Let a = 2c, then 4c² = 2b², b² = 2c²
           So b is also even, contradiction.
           </think>
         <answer>
           √2 is irrational because assuming rationality
           leads to a contradiction.
           </answer>"

Supervised fine-tune with rule-based reward.

Stage 2 — Reasoning-focused RL - RL (GRPO) with reward based on <think> length and accuracy of the final answer. - <answer> stays in a fixed template.

Stage 3 — Rejection sampling + SFT - Collect only the trajectories that reached the correct answer; use them for re-supervised training.

Stage 4 — Full RLHF + preference - <answer> is shaped for helpfulness / harmlessness.

S: How long is <think> in practice?

P: R1's think averages 2000-4000 tokens, stretching to 16K+ on difficult problems. With a 32K total context, think can occupy 28K sometimes.

S: So max_seq_len has to be large.

P: And KV sharing matters — if you throw away think's KV before answer starts, you waste memory. R1 shares.

S: How is Quiet-STaR different?

P: Zelikman et al. (NeurIPS 2024). Alternative approach:

	R1	Quiet-STaR
Phase shape	One long think, then answer	Short rationale per token
Rationale length	2K-16K	16 (short)
Transition	`<think_end>` special token	Fixed-length steps
Training signal	Final-answer reward	Next-token loss reduction

Both fit the same execution_modes schema.

3.3 YAML derivation

S: Let's build R1's YAML from empty.

P: Base is DeepSeek-V3-like. I'll keep MLA and drop MoE to stay simple.

schema_version: 1

model:
  name: r1-clone
  d_model: 1024              # educational
  vocab_size: 32000
  max_seq_len: 16384         # think(8K) + answer(2K) + headroom
  n_heads: 16
  n_kv_heads: 4              # GQA
  mlp_ratio: 4

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0       # long context
  tie_word_embeddings: true

S: How do I pick max_seq_len?

P: sum(execution_modes.max_tokens) + buffer. Here 8192 + 2048 + 6144 = 16384. The execution_modes_budget realism rule auto-checks it.

S: Template:

P: A single MLA attention template.

layer_templates:
  reasoner:
    mixer:
      type: attention
      attention:
        latent_dim: 512       # d_model/2 — helps long think
        qkv_bias: false
    ffn:
      type: gated_mlp
      activation: swiglu
    norm: { type: rmsnorm, position: pre }
    state: { kv_cache: true }

(Real R1 also has MoE; we drop it for brevity.)

layer_schedule:
  - template: reasoner
    repeat: 24

head:
  type: causal_lm
  tie_weights: true

compatibility:
  compile_target: huggingface

S: That's basically V3. Where does R1 kick in?

P: Only the execution_modes block at the bottom.

execution_modes:
  - name: think
    max_tokens: 8192
    kv_share: true                # answer reuses think's KV
    loss_weight: 0.0              # excluded from primary LM loss
    visible_to_user: false        # hidden from user at serve time
    per_token_rationale: false    # R1 style, not Quiet-STaR

  - name: answer
    max_tokens: 2048
    kv_share: false
    loss_weight: 1.0              # answer participates in LM loss
    visible_to_user: true
    per_token_rationale: false

transition:
  type: special_token
  token: "<think_end>"

S: Explain kv_share: true once more.

P: During think, the model builds a KV cache. When answer starts, continue using that same cache rather than starting fresh. If false, answer begins with empty KV and has to "re-read" think — wasteful. R1 is true on think.

S: per_token_rationale?

P: Quiet-STaR's switch. R1 keeps it false. Quiet-STaR flips it on:

execution_modes:
  - name: rationale
    max_tokens: 16
    per_token_rationale: true
    loss_weight: 0.1
    visible_to_user: false

  - name: answer
    max_tokens: 256
    loss_weight: 1.0
    visible_to_user: true

S: What transition.type values exist?

special_token: phase switches when a token like <think_end> is generated. R1.
budget_exhausted: force a switch when max_tokens is hit. Anti- infinite-loop guard.

special_token is the common primary; budget_exhausted is a fallback. Many recipes combine them as "whichever happens first".

3.4 Validate & compile

S: Validate:

eulerstack validate --preset r1_clone.yml --report

[validation] schema ... PASS
[validation] cross-field ... PASS
  - MLA latent_dim=512 < d_model=1024 ✓
  - d_model % n_heads == 0 (head_dim=64) ✓
[validation] execution_modes ... PASS
  - 2 modes: [think, answer]
  - transition: special_token ("<think_end>") ✓
  - execution_modes_budget: 8192 + 2048 = 10240 ≤ max_seq_len=16384 ✓
[params] estimated: ~480M
[realism] PASS
  - kv_share=true + think-before-answer: canonical R1 shape
OK: r1_clone.yml is valid.

S: What ends up in config.v1_extensions after compile?

from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("./r1_clone", trust_remote_code=True)
print(cfg.v1_extensions)
# {
#   'execution_modes': [...],
#   'transition': {'type': 'special_token', 'token': '<think_end>'},
#   'schedule_kinds': [...]
# }

This is the shared contract the training script and the serving generate() both read.

3.5 Parameter tuning guide

S: How big should think.max_tokens be?

P: Depends on difficulty.

Use case	think max_tokens	answer max_tokens
Simple QA (history, trivia)	512	512
Mid-level math proof	2048	1024
Olympiad math / code	8192	2048
Multi-step agent	16384	4096
Open-research hard	32768	8192

R1 recommends 8K think as default.

S: loss_weight: 0.0 — does that mean think isn't trained?

P: Not trained by primary LM loss. The RL stage uses a separate reward. If you train think with LM loss you end up memorising long strings, which hurts generalisation.

Quiet-STaR uses loss_weight: 0.1 instead — small weight, so LM loss does contribute a little. Different philosophy — "learn rationales that improve next-token prediction".

S: When should kv_share be false?

P: Two cases:

Multi-turn reasoning where each turn is an independent think. Don't share across turns.
Think-restart protocols that discard failed thinks and retry.

Default: think.kv_share: true, answer.kv_share: false. Answer reuses think's KV, but answer's own KV is discarded when done.

3.6 Traps & debugging

Trap 3.1. sum(execution_modes.max_tokens) > model.max_seq_len

model: { max_seq_len: 4096 }
execution_modes:
  - { name: think, max_tokens: 8192 }    # ← budget overflow
  - { name: answer, max_tokens: 2048 }

→ execution_modes_budget realism warning. Position overflow at runtime.

Trap 3.2. transition.token not in the tokenizer vocab

transition:
  type: special_token
  token: "<think_end>"         # without registering, becomes unk

→ Validates fine, but the model splits it into "<", "think_end" etc. at runtime — transition detection fails. Register the special token:

tokenizer.add_special_tokens({"additional_special_tokens": ["<think_end>"]})

Trap 3.3. "Declaring execution_modes makes it automatically work" → EulerStack only preserves the metadata. Phase-aware generate and RL training must be implemented by the user. Sample generate loop in examples/r1_generate.py (roadmap v1.2).

Trap 3.4. Using per_token_rationale: true in an R1 mode → R1 is "one long think". per_token_rationale is Quiet-STaR-only. Easy to confuse.

Trap 3.5. kv_share: true but position indexing resets → If answer reuses KV, its position indices must continue — answer's first token is at index len(think) + 1. Easy to miss in implementations.

Trap 3.6. "loss_weight=0 → not learned" so skip RL too → loss_weight is the LM-loss weight; RL reward is separate. If both are 0, nothing is learned. R1: LM loss 0, RL reward based on answer correctness.

3.7 Combining with other primitives

3.7.1 R1 + MLA (recommended)

The YAML above already combines them. Long think makes KV cache the bottleneck, and MLA is a natural partner.

3.7.2 R1 + Titans memory

layer_templates:
  reasoner_with_memory:
    mixer: { type: attention, attention: { latent_dim: 512 } }
    ffn: { type: gated_mlp }
    memory:
      type: neural_memory
      update_at_inference: true
      params: { hidden: 2048 }
      inner_lr: 0.001
      persistence: session       # think accumulates insight, answer uses it

Experimental — think populates memory; answer consults it. No established academic result yet, but an interesting research direction.

3.7.3 R1 + MoE

Real R1 rides on V3, so it has MoE. We dropped it for clarity; add it back for scale:

ffn:
  type: moe
  moe: { experts: 64, top_k: 6, router: softmax, capacity_factor: 1.25 }

3.7.4 R1 + Neural-ODE integrator

layer_schedule:
  - integrator:
      type: ode_rk4
      steps: 4
      body: reasoner
      output: token

Research — interpret reasoning as a continuous ODE flow. CoT as a dynamical system. Early days.

3.8 Further reading

Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," arXiv:2501.12948 (2025)
Prior: OpenAI "Learning to Reason with LLMs" blog (o1 announcement, 2024-09)
Related: Zelikman et al., "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking," arXiv:2403.09629 (NeurIPS 2024)
Follow-ups: o3 technical report, Kimi k1.5 report
EulerStack preset: configs/presets/arch_expert_reasoning_r1.yml
Related tutorial: 09 new primitives §12
Next experiment: add MLA to arch_expert_reasoning_r1, push think.max_tokens 8192 → 16384, rope_theta up to 1M — verify a true long-think flow still works.

Case 4. Titans — neural memory that learns at inference time

4.1 Background & motivation

S: "Learns at test time" — does that even make sense?

P: It genuinely does. Titans (Behrouz, Zhong, Mirrokni — Google, 2024-2025) makes it academically rigorous. The idea:

Only a small "memory module" attached to each layer — not the whole model — takes inner gradient steps during inference.

S: Why do we need it? We have a KV cache.

P: Watch what KV cache can't do.

Mechanism	Parametric?	Grows with seq len?	Update mode
KV cache	✗	✓ linear growth	Token append only
Mamba state	✗ (fixed size)	✗	Recurrent update
Titans memory	✓	✗	Gradient step

Titans' twist: it's a small parametric network whose parameters get gradient-updated during inference. KV stores "what was seen"; Titans memory stores "what was learned from seeing it".

S: Any prior work?

Work	Year	Idea	Relation to Titans
Memorizing Transformers (Wu et al.)	ICLR 2022	External KV memory + k-NN lookup	Non-parametric ancestor
RETRO (Borgeaud et al.)	ICML 2022	Retrieval-augmented	External DB; Titans is internal
Infini-attention (Google, 2024)	2024-04	Compressive memory with MLP	Closest precursor
TTT layer (Sun et al., 2024)	2024-07	Inner-optim for the whole mixer	Layer-wide; Titans is a side-module
Titans (Behrouz et al.)	2024-12	Small MLP per layer, inner SGD at inference	★

Think of Titans as the TTT-style inner-optim idea restricted to a small auxiliary memory module — a practical engineering.

S: What can it actually do?

P: From the paper's benchmarks:

Needle-in-haystack at 2M context: beats Transformer + RAG.
Ongoing-conversation consistency: session-wide coherence improves.
Long-form knowledge accumulation: reading a book gradually builds a summary inside memory.

Regular LLMs forget once context overflows; Titans remembers by "summarising into memory." That's what "test-time learning" actually buys.

4.2 Mechanism in depth

S: Math please.

P: The Titans memory module M is a small MLP:

M_θ(x) = W_2 · GeLU(W_1 · LayerNorm(x) + b_1) + b_2
         θ = {W_1, b_1, W_2, b_2, m_state}
         m_state ∈ R^h   ← "persistent memory vector"

Forward is a standard residual:

y = x + g(x) ⊙ M_θ(x)        where g(x) = sigmoid(W_g x)

g(x) gates memory's contribution per token.

S: The "inference-time update"?

P: After each forward an optional hook runs:

# 1. compute "surprise"
surprise = || M_θ(x) - x ||²      ← how well memory reconstructs the
                                      current hidden

# 2. gradient step on memory params only (not base model!)
θ ← θ - η · ∇_θ surprise

S: Why reconstruction MSE for surprise?

P: Information-theoretic intuition. If memory can't reconstruct x, then x is "unexpected" → worth remembering. Large update, memory learns x. If x is already well-predicted, update is small — already known.

┌───────────────────────────────────────────────────────┐
│  Surprise learning loop                                │
├───────────────────────────────────────────────────────┤
│                                                         │
│   token_t ──→ layer ──→ hidden_t                        │
│                            │                            │
│                    ┌───────┴───────┐                    │
│                    │               │                    │
│                 forward         surprise                │
│                    │               │                    │
│                    ▼               ▼                    │
│               y = x + M(x)    || M(x) - x ||²          │
│                                    │                    │
│                                    ▼                    │
│                            ∇_θ surprise                 │
│                                    │                    │
│                                    ▼                    │
│                         θ ← θ - η · ∇_θ                │
│                                                         │
└───────────────────────────────────────────────────────┘

S: What prevents gradient leaking into the base model?

P: Gradients are isolated to the memory parameters:

# pseudo-code
with torch.enable_grad():
    for p in memory.parameters():
        p.requires_grad_(True)
    for p in base_model.parameters():
        p.requires_grad_(False)

    surprise = mse(memory(x_det), x_det)   # x_det = x.detach()
    surprise.backward()

    with torch.no_grad():
        for p in memory.parameters():
            p.sub_(p.grad, alpha=inner_lr)
            p.grad = None

x is detached so no grad flows into the outer graph. Only memory parameters move.

S: persistence?

P: Lifetime policy:

per_query: reset to initial state after each forward. Single-shot.
session: persist through a conversation. Reset at the end. Default.
persistent: never reset; serialised to model.safetensors on save_pretrained. Carries over across sessions.

persistent is dangerous for multi-user serving — every user's data lands in the same memory. Serving should stay on session with per- user model instances.

4.3 YAML derivation

S: Empty YAML first.

schema_version: 1

model:
  name: titans-demo
  d_model: 768
  vocab_size: 32000
  max_seq_len: 8192
  n_heads: 12
  n_kv_heads: 4
  mlp_ratio: 4

tokenizer_contract:
  type: hf
  pretrained: gpt2
  add_bos: true
  add_eos: true

embedding:
  type: learned
  positional: rope
  rope_theta: 500000.0
  tie_word_embeddings: true

S: Now memory enters the template.

P: Yes. The key piece.

layer_templates:
  attn_with_memory:
    mixer:
      type: attention
      attention:
        qkv_bias: false
    ffn:
      type: gated_mlp
      activation: swiglu
    norm: { type: rmsnorm, position: pre }
    state: { kv_cache: true }

    memory:
      type: neural_memory
      update_at_inference: true       # enable grad step at inference
      params:
        hidden: 1024                  # internal MLP size
      inner_lr: 0.001                 # SGD step size
      persistence: session            # per_query | session | persistent

S: How do I choose hidden: 1024?

P: That knob controls memory's expressive capacity.

`hidden`	Added params (d=768)	Use case
256	~0.4M/layer	Minimal — memory mostly idle
512	~0.8M/layer	Light — basic fact memorisation
1024	~1.6M/layer	Standard — Titans paper default
2048	~3.2M/layer	Larger — long-form summarisation
4096	~6.4M/layer	Excessive — overfitting risk

1× to 2× d_model is the sweet spot.

S: inner_lr: 0.001?

P: Bigger than outer LR (AdamW 3e-4) is OK — inner updates only touch memory and only one step at a time.

0.0001: very conservative, barely updates
0.001: standard
0.01: aggressive, fast learning, possible instability
0.1+: dangerous — memory overreacts to recent surprises

Titans paper uses 0.001.

S: Does memory need to be on every layer?

P: Optional. For example, only the last 8 of 24 layers:

layer_templates:
  plain_attn:
    mixer: { type: attention, attention: {} }
    ffn:   { type: gated_mlp }

  attn_with_memory:
    mixer: { type: attention, attention: {} }
    ffn:   { type: gated_mlp }
    memory: { ... }

layer_schedule:
  - template: plain_attn
    repeat: 16                # first 16 — plain
  - template: attn_with_memory
    repeat: 8                 # last 8 — memory

Intuition: "early layers recognise patterns, late layers accumulate memory."

S: Full YAML:

layer_schedule:
  - template: attn_with_memory
    repeat: 18

head:
  type: causal_lm
  tie_weights: true

compatibility:
  compile_target: huggingface

4.4 Validate & compile

S: Validate:

[validation] schema ... PASS
[validation] memory ... PASS
  - type: neural_memory ∈ {neural_memory, associative, retrieval} ✓
  - persistence: session ∈ {per_query, session, persistent} ✓
  - params.hidden: 1024 > 0 ✓
[params] estimated: ~310M
  - base attention + ffn: ~280M
  - Titans memory × 18 layers: ~29M
[realism] PASS
OK: titans_demo.yml is valid.

S: How do I confirm memory actually instantiated?

from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes
from eulerstack.components.titans_memory import TitansMemoryModule

register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./titans_demo",
                                              trust_remote_code=True)

mems = [m for m in model.modules() if isinstance(m, TitansMemoryModule)]
print(f"Titans memory modules: {len(mems)}")
# → Titans memory modules: 18
print(f"params per memory: {sum(p.numel() for p in mems[0].parameters()):,}")
# → params per memory: ~1.6M

S: How to invoke the inference-time update?

P: There's a standardised hook on the model.

model.eval()
ids = tokenizer("Paris is the capital of France.",
                return_tensors="pt").input_ids

# 1. forward
out = model(ids, output_hidden_states=True)

# 2. standard hook — this one line *is* test-time learning
surprise = model.step_memory_at_inference(out.hidden_states[-1])

print(surprise)
# {
#   'eulerstack.layers.0.titans_memory': 0.4231,
#   'eulerstack.layers.1.titans_memory': 0.3894,
#   ...
# }

Each value is the surprise loss per layer's memory. Over time these values should trend down — memory is getting better at predicting the incoming hiddens, i.e. "learning."

4.5 Parameter tuning guide

S: Which layer depths should carry memory?

P: Two approaches:

Strategy A — every layer (paper default)

layer_schedule:
  - template: attn_with_memory
    repeat: 24

+10-15 % params, everyone contributes.

Strategy B — back half only (efficiency)

layer_schedule:
  - template: plain_attn
    repeat: 18
  - template: attn_with_memory
    repeat: 6       # back 25%

+3 % params, empirically retains ~75 % of memory's gains. Recommended.

Strategy C — middle only (hypothesis)

layer_schedule:
  - template: plain_attn
    repeat: 8
  - template: attn_with_memory
    repeat: 8       # middle third
  - template: plain_attn
    repeat: 8

Research hypothesis: memory helps most on representational-middle layers. Under study.

S: When to use update_at_inference: false?

P: Training-only mode. Outer optimiser learns memory, but at inference it's frozen; the hook is a no-op.

Use cases:

A/B testing serving — avoid memory drift.
Train once, freeze for serving — same memory for all users.
Debugging — isolate effects of inference updates.

Production default: session persistence + update_at_inference: true.

4.6 Traps & debugging

Trap 4.1. persistence: persistent in multi-user serving → Privacy/security disaster. All users' data accumulates in the same memory. A's memory leaks into B's answers. Serving must use session with isolated per-session model instances.

Trap 4.2. update_at_inference=true + huge inner_lr

memory:
  update_at_inference: true
  inner_lr: 0.5              # ← too big

→ After a few tokens memory diverges. Answer quality degrades mid- session. Keep inner_lr ≤ 0.001.

Trap 4.3. Memory on every mixer type indiscriminately

layer_templates:
  mamba_with_memory:
    mixer: { type: mamba, ... }
    memory: { type: neural_memory, ... }

→ Legal, but Mamba already "remembers" via state. Roles overlap and training gets unstable. Keep memory on attention layers.

Trap 4.4. params.hidden < d_model/2

memory:
  params: { hidden: 128 }   # too small for d_model=768

→ Memory lacks capacity and "remembers nothing". Use hidden ≥ d_model/2 as minimum.

Trap 4.5. Surprise is ~0 from the start

surprise = model.step_memory_at_inference(hidden)
print(surprise)  # {'...': 0.0001, ...}

→ TitansMemoryModule.out_proj.weight is initialised to zero (so the module starts near-identity). Expected for fresh models. Run a few training steps, then re-check.

Trap 4.6. Surprise trending up during inference → Two common causes: 1. inner_lr too big → overshooting. Lower it. 2. Input distribution shifted drastically (e.g. different language). Memory is adapting. Observe whether it settles after 10-20 tokens.

4.7 Combining with other primitives

4.7.1 Titans + MLA

layer_templates:
  mla_with_memory:
    mixer:
      type: attention
      attention: { latent_dim: 384 }
    ffn: { type: gated_mlp }
    memory:
      type: neural_memory
      update_at_inference: true
      params: { hidden: 1024 }
      inner_lr: 0.001

Strong — MLA shrinks KV cache; memory handles long-term recall. Particularly effective in 128K+ contexts.

4.7.2 Titans + R1 (execution_modes)

layer_templates:
  reasoner_with_memory:
    mixer: { type: attention, attention: { latent_dim: 512 } }
    ffn: { type: gated_mlp }
    memory:
      type: neural_memory
      update_at_inference: true
      params: { hidden: 2048 }           # bigger memory
      inner_lr: 0.001
      persistence: session

execution_modes:
  - { name: think, max_tokens: 8192, kv_share: true, loss_weight: 0.0 }
  - { name: answer, max_tokens: 2048, loss_weight: 1.0 }

High research value — think accumulates insight into memory; answer consults it. Multi-turn sessions can "carry" earlier reasoning. Likely a heavily-studied combination in 2025-26.

4.7.3 Titans + Jamba

layer_templates:
  attn_moe_with_memory:
    mixer: { type: attention, attention: {} }
    ffn:
      type: moe
      moe: { experts: 8, top_k: 2, router: softmax, capacity_factor: 1.25 }
    memory:
      type: neural_memory
      update_at_inference: true
      params: { hidden: 1024 }

Memory only on attention layers (1/8). O(N) + retrieval + long-term remembrance.

4.7.4 Titans + TTT (double inner-optim)

Experimental — TTT already uses inner gradient; adding Titans memory means two inner optimisers in the same model. Stability hard to reason about; no strong empirical reports yet. Research topic.

4.8 Further reading

Paper: Behrouz, Zhong, Mirrokni, "Titans: Learning to Memorize at Test Time," Google (2024-12 / 2025)
Predecessor: Wu et al., "Memorizing Transformers," ICLR 2022
Predecessor: Borgeaud et al., "RETRO," ICML 2022
Related: Munkhdalai et al., "Infini-attention," arXiv:2404.07143 (2024)
Conceptual ancestor: Ha & Schmidhuber, "Fast Weights" (2017)
EulerStack preset: configs/presets/arch_expert_titans_memory.yml
Related tutorial: 09 new primitives §10
Tests: tests/test_titans_memory_runtime.py (12 tests — module unit through HF round-trip)
Next experiment: run arch_expert_titans_memory with inner_lr ∈ {0.0001, 0.001, 0.01}, compare final-session surprise. At what value does "learning" actually happen?

Closing — common shape across the four cases

Case	Changed primitive	EulerStack field	Runtime (v1.1)	Core idea
DeepSeek-V3	KV-compressed attention	`attention.latent_dim`	✅ Core	Trade dimensionality for memory; keep quality
Jamba	Mamba/attention hybrid + MoE	`layer_schedule` + `ffn.type: moe`	✅ Core	7:1 mix captures both speed and ICL
DeepSeek-R1	Reasoning-phase separation	`execution_modes:` + `transition:`	✅ Core (metadata)	Structure unchanged, recipe changes
Titans	Test-time parametric memory	`template.memory:` + `step_memory_at_inference`	✅ Core (v1.1)	Grad step on memory only, at inference

Four common lessons

"One YAML file ≈ one paper." In every case the architectural change lives in 5-20 lines of YAML, not 200 lines of modeling_custom.py.
Orthogonality by design — MLA doesn't touch FFN, MoE doesn't touch attention, execution_modes doesn't touch structure, Titans memory doesn't touch forward path. This is why combinations just work.
Metadata first, runtime progressively — schema-first. R1's execution_modes works for round-trip even before generate is fully wired. Titans shipped schema in v1.0 and runtime in v1.1.
Lego blocks really compose — combinations are arithmetic — MLA × Jamba × R1 × Titans adds up to a four-paper YAML. The §1.7 / §2.7 / §3.7 / §4.7 combination examples roll up into arch_expert_kitchen_sink.yml.

Port your own 5th paper using the same eight-step template: background → mechanism → YAML → validate → tuning → traps → combinations → further reading. If your YAML passes eulerstack validate --report you almost certainly captured the paper correctly.
Diff the four flagship presets (arch_advanced_mla, arch_advanced_jamba, arch_expert_reasoning_r1, arch_expert_titans_memory) to see how "one primitive changes at a time" looks in practice.
The capstone — arch_expert_kitchen_sink combines every v1 primitive in one spec, and tests/test_kitchen_sink_preset.py auto-verifies that the full pipeline (validate → compile → save/load → HF training) still works.
Revisit Tutorial 9 for per-field reference after this high-level tour.
Browse all 53 presets: eulerstack --lang en presets list.

← Prev 9. v1 Phase B New Primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT)