10. Paper → YAML Case Studies (DeepSeek-V3 / Jamba / DeepSeek-R1 / Titans)
This tutorial walks through four recent architectures as a dialogue between a professor (P) and a student (S). Unlike a quick cheat-sheet, the goal here is not just "write this YAML" but "understand why each field is there". Every case follows the same eight-step structure.
| Step | Content |
|---|---|
| §X.1 Background & motivation | Why did this paper appear? What bottleneck does it fix, and how does it relate to prior work? |
| §X.2 Mechanism in depth | Equations and ASCII diagrams of the internals. How it differs from alternatives. |
| §X.3 YAML derivation | Start from an empty YAML and fill in each field with its rationale. |
| §X.4 Validate & compile | Actual CLI commands, expected output, parameter-count arithmetic. |
| §X.5 Parameter tuning | Sensible ranges for latent_dim, capacity_factor, etc. |
| §X.6 Traps & debugging | 4-6 common pitfalls beginners hit, mapped to error messages. |
| §X.7 Combining with other primitives | Orthogonality / conflict matrix for 2nd-order combinations. |
| §X.8 Further reading | Paper links, follow-up research, relevant presets and tutorials. |
Table of contents: - Case 1. DeepSeek-V3 — MLA (Multi-head Latent Attention) - Case 2. Jamba — Mamba × Attention × MoE hybrid - Case 3. DeepSeek-R1 — 2-phase reasoning (execution_modes) - Case 4. Titans — neural memory that learns at inference time - Closing — common shape across the four cases
Case 1. DeepSeek-V3 — MLA (Multi-head Latent Attention)
1.1 Background & motivation
S: Professor, DeepSeek-V3 (2024-12) makes a big deal of Multi-head Latent Attention. It looks like just another attention variant — why the hype?
P: Context first. In 2023-24 "long context" became the main axis of LLM competition. Llama 2's 4K → Llama 3's 8K → 32K → 128K in a year. The problem is this race came with unequal costs.
S: Unequal how?
P: Attention's compute cost collapsed thanks to FlashAttention, but the memory cost stayed. Specifically the KV cache size.
KV cache ≈ 2 · n_layers · n_heads · head_dim · seq_len · batch · dtype_bytes
For Llama 3 70B at 128K context in bf16 that's about 5 GB per layer, or ~400 GB across 80 layers. That won't fit on a single H100.
S: What did the field do about it?
P: Three paths:
- Grouped-Query Attention (GQA) — Llama 2/3. Heads share K/V, cutting cache up to 8×. Cost: quality regression.
- Sliding-window attention — Mistral / Longformer. Give up global attention, look inside a window only. Cost: weaker long-range.
- Multi-head Latent Attention (MLA) — DeepSeek-V2/V3. Compress K/V into one low-dimensional latent and only cache that latent. No quality loss, memory savings.
Option 3 is MLA. GQA sacrifices head diversity, sliding window sacrifices locality, MLA sacrifices dimensionality, but it rebuilds diversity at query time via up-projection — which is why it (roughly) matches MHA quality.
S: Is MLA the current king?
P: As of 2025, MLA is effectively the standard for long-context memory efficiency. DeepSeek-V2/V3, some Qwen 2.5 variants, and Kimi's MoBA family all adopted it. Llama-family is still on GQA but next generations will likely move to MLA.
S: What's the difference between DeepSeek-V2 and V3?
P: MLA itself was introduced in V2. V3 is a scaled version (236B → 671B, MTP / Multi-Token Prediction training trick). The MLA structure is identical. The YAML we'll write is just "one MLA line" — the size comes from hyperparameters.
1.2 Mechanism in depth
S: Can I see MLA in equations?
P: Standard MHA first. Given x ∈ R^{T×d}:
Q = x W_q (d → d)
K = x W_k (d → d)
V = x W_v (d → d)
attention = softmax(Q K^T / √d_h) V
MHA splits d into n_heads × head_dim, each head runs independently.
What lives in the KV cache is both K and V, each shaped
(B, n_heads, T, head_dim).
S: And MLA?
P:
# Down-projection (once, shared between K and V)
kv_latent = x W_kv (d → l) ← this is what you cache
# Up-projection (reconstructed at compute time)
K = kv_latent W_k_up (l → d) ← rebuilt each time
V = kv_latent W_v_up (l → d)
# Q stays standard
Q = x W_q (d → d)
S: So "throw K and V away, keep only the latent that built them."
P: Exactly. The cache drops from 2 × d → 1 × l. With l = d/2
that's a 75% reduction in cache memory. With l = d/4 it's 87.5%.
Without sharing heads.
S: But now you do an up-projection on every step. Doesn't that add compute?
P: Two tricks mitigate it:
- Absorb trick: at inference you can fuse
Q W_k_up^Tahead of time so the up-projection effectively disappears. DeepSeek does this. - Parallelism: the up-projection is a single matmul, and because attention is memory-bound on long context, the extra FLOPs cost is almost free relative to the KV read.
┌─────────────────────────────────────────────────┐
│ Cache layout comparison │
├─────────────────────────────────────────────────┤
│ MHA: [K: H×D] [V: H×D] → 2HD/token │
│ GQA (4:1): [K: H/4×D] [V: H/4×D] → HD/2 │
│ MLA l=d/2: [kv_latent: L] → L = D/2 │
│ │
│ d_model=768, 32 layers, 128K ctx, bf16: │
│ MHA: 24 GB │
│ GQA: 6 GB │
│ MLA: 6 GB (with MHA-level quality) │
└─────────────────────────────────────────────────┘
S: What about parameter count?
P: Per attention block:
- MHA:
W_q, W_k, W_v, W_oeachd²→4d² - GQA:
W_q (d²), W_k, W_v (d×d/g), W_o (d²)→2d² + 2d²/g - MLA:
W_q (d²), W_kv (d×l), W_k_up, W_v_up (l×d), W_o (d²)→2d² + 3dl
With l = d/2 that's 2d² + 1.5d² = 3.5d² — MLA has fewer parameters
than MHA and 25 % of the cache and comparable quality. Hence the
"free lunch" framing.
1.3 YAML derivation
S: What does this look like in EulerStack?
P: Start empty. Model metadata first.
schema_version: 1
model:
name: deepseek-v3-clone
d_model: 768 # small for education. Real V3 has d_model=7168
vocab_size: 32000
max_seq_len: 32768 # 32K; V3 pushes to 128K
n_heads: 12 # head_dim = 768/12 = 64
n_kv_heads: 12 # MLA is orthogonal to GQA; both can coexist
mlp_ratio: 4
S: n_kv_heads = n_heads means "no GQA"?
P: Right. MLA already compresses KV through a latent; layering GQA
on top gets redundant. DeepSeek's paper uses MLA only. If you wanted
"MLA + GQA 4:1" for aggressive savings you could set n_kv_heads: 3,
but quality falls fast. Not recommended.
S: Tokenizer and embedding next.
P: DeepSeek's real tokenizer is a proprietary 100K BPE; for educational use we point at gpt2 just to pass validation.
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
embedding:
type: learned
positional: rope
rope_theta: 500000.0 # large theta for long context
tie_word_embeddings: true
S: Why is rope_theta so big?
P: Great question. RoPE's default theta = 10000 dates to GPT-2's
4K-context era. Scale the context to 32K+ and the high-frequency axes
start to wrap around — the model loses its ability to distinguish
far-apart positions. Bigger theta means longer periods, keeping each
position distinguishable. Llama 3 uses 500K, DeepSeek goes up to 10M.
S: Now layer_templates.
P: This is where MLA actually lives.
layer_templates:
mla_decoder:
mixer:
type: attention
attention:
latent_dim: 384 # ← MLA in one line; half of d_model=768
qkv_bias: false
ffn:
type: gated_mlp
activation: swiglu
norm:
type: rmsnorm
position: pre
residual:
type: sequential
scaling: 1.0
state:
kv_cache: true
S: latent_dim: 384 is really the whole thing?
P: Yes. EulerStack's CausalSelfAttention instantiates five
projections (Q, kv_latent, K_up, V_up, out) when latent_dim is set;
otherwise the standard fused QKV path. At the spec layer you just
declare "use MLA" and the runtime does the rest.
S: Schedule:
P:
layer_schedule:
- template: mla_decoder
repeat: 12 # educational. V3 has 61 layers.
head:
type: causal_lm
tie_weights: true
compatibility:
compile_target: huggingface
Done. That is a DeepSeek-V3 MLA structure.
1.4 Validate & compile
S: How do I check it?
P: Save it, one CLI call.
eulerstack --lang en validate --preset deepseek_v3_clone.yml --report
Expected output (abridged):
[validation] schema ... PASS
[validation] cross-field ... PASS
- mixer.latent_dim=384 < d_model=768 ✓
- d_model=768 % n_heads=12 == 0 ✓
- head_dim = 64 (within 32-256 recommended range) ✓
[params] estimated: ~85M
- embedding (learned + tied): ~25M
- attention × 12: ~32M (MLA savings reflected)
- ffn (swiglu) × 12: ~28M
[realism] PASS
- head_dim: 64 (recommended range)
- rope_theta: 500000 (long-context justification OK)
- n_kv_heads == n_heads: GQA off, MLA handles KV compression alone
OK: deepseek_v3_clone.yml is valid.
S: Is 85M correct? Can I cross-check?
P:
- Embedding (tied): vocab × d = 32000 × 768 ≈ 24.6M
- Per attention layer (MLA, l=384):
2 × 768² + 3 × 768 × 384=1.18M + 0.88M= 2.06M - Per FFN (SwiGLU, ratio=4): 3 matrices ×
768 × 3072≈ 7.1M - Per norm: 2 × 768 ≈ 1.5K (negligible)
- 12 layers: 12 × (2.06M + 7.1M) ≈ 110M
- Tied embed: 24.6M once
Total ~134M. The CLI estimate is approximate and can be ±20%.
S: Let's actually build the HF model.
P:
eulerstack --lang en compile --preset deepseek_v3_clone.yml --output-dir ./v3_clone
from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes
register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./v3_clone", trust_remote_code=True)
from eulerstack.blocks.attention import CausalSelfAttention
mla_layers = [m for m in model.modules()
if isinstance(m, CausalSelfAttention) and m.latent_dim is not None]
print(f"MLA layers: {len(mla_layers)}, latent_dim={mla_layers[0].latent_dim}")
# → MLA layers: 12, latent_dim=384
print([n for n in mla_layers[0].state_dict() if 'proj' in n or 'latent' in n])
# → ['q_proj.weight', 'kv_latent.weight', 'k_up.weight', 'v_up.weight', 'out_proj.weight']
Seeing kv_latent / k_up / v_up in the state dict means real MLA is
instantiated.
1.5 Parameter tuning guide
S: Is there a rule for picking latent_dim?
P: Practical table:
latent_dim |
Cache savings | Quality impact | Use case |
|---|---|---|---|
d_model |
0% | Identical to MHA | (meaningless — validator rejects) |
d_model × 0.75 |
~25% | ~0 | Conservative — first MLA adoption |
d_model × 0.5 |
~75% | ~0 | Recommended starting point. DeepSeek default |
d_model × 0.33 |
~83% | Slight | Tight-memory 128K+ |
d_model × 0.25 |
~87.5% | Noticeable | Extreme — require A/B |
d_model × 0.125 |
~94% | Severe | Avoid |
S: What's its relationship with head_dim?
P: Keep latent_dim an integer multiple of head_dim. The KV
cache stride stays clean and FlashAttention-like kernels map well.
With d_model=768, head_dim=64, sensible values are
{128, 192, 256, 384, 512}.
S: How long can max_seq_len realistically go with MLA?
P: MLA's win grows with context. Below 2K it barely matters. Above 16K you feel it. By 128K it's essentially mandatory on single-node serving with A100/H100. DeepSeek runs at 128K out of the box.
S: Relation between rope_theta and max_seq_len?
P: Rule of thumb:
rope_theta ≈ max(10000, 10000 × (max_seq_len / 2048)²)
- 2K: 10K
- 4K: 40K
- 8K: 160K
- 32K: 2.6M (DeepSeek actually uses 500K and it suffices)
- 128K: 42M (at this point YaRN / NTK-aware scaling typically takes over)
1.6 Traps & debugging
S: What are the most common mistakes authoring an MLA YAML?
P: Six:
Trap 1. latent_dim >= d_model
ValidationError (R): mla_latent_dim_range — latent_dim=1024 exceeds d_model=768
Fix: set latent_dim < d_model (commonly d_model/2)
→ MLA means "compression"; ≥ d_model is meaningless.
Trap 2. latent_dim is odd or not a multiple of head_dim
warning: mla_latent_dim_range — latent_dim=100 is not a multiple of head_dim=64
→ Validate passes but the KV cache stride breaks, causing subtle perf regressions. Round to a multiple of 64.
Trap 3. "MLA alone solves long context"
→ MLA only cuts memory. If rope_theta is too small, positions
still collide. Tune both together.
Trap 4. MLA layer with kv_cache: false
state:
kv_cache: false # ← eliminates MLA's raison d'être
→ MLA only pays off with the cache on. During pure training it may be off; for inference deployment it must be on.
Trap 5. Overlapping GQA and MLA
n_kv_heads: 2 # 4:1 GQA
attention:
latent_dim: 128 # plus MLA
→ Technically legal but quality falls quickly. MLA alone is enough;
keep n_kv_heads == n_heads.
Trap 6. "I can just load DeepSeek-V3's pretrained weights into my MLA YAML" → EulerStack ships randomly initialised models. DeepSeek-V3's real weights require a separate downloader and a conversion script (roadmap v1.2).
1.7 Combining with other primitives
S: What can I combine MLA with?
P: Four combinations matter in practice.
1.7.1 MLA + MoE (DeepSeek-V3's actual architecture)
layer_templates:
mla_moe:
mixer:
type: attention
attention: { latent_dim: 384 }
ffn:
type: moe
moe: { experts: 32, top_k: 3, router: softmax, capacity_factor: 1.25 }
Orthogonal — MLA compresses attention memory, MoE scales FFN capacity. They never fight for the same real estate. This is what DeepSeek-V3 actually ships.
1.7.2 MLA + Jamba-style hybrid
layer_schedule:
- template: mla_decoder
repeat: 6
- template: mamba_block
repeat: 18
Complementary — only 1/4 of the layers are MLA; the rest are Mamba. Global-retrieval layers keep KV-cache savings; the rest enjoy Mamba's O(N). A plausible Samba / Jamba 1.5 upgrade.
1.7.3 MLA + Mixture-of-Depths
layer_schedule:
- template: mla_decoder
repeat: 24
depth_gating:
enabled: true
capacity: 0.5
router: top_k
Synergy — MoD skips 50% of tokens through this layer, so MLA's KV cache only grows by the surviving tokens. MoD's token savings multiply with MLA's memory savings. Very strong for 128K contexts.
1.7.4 MLA + execution_modes (R1 + V3)
execution_modes:
- { name: think, max_tokens: 16384, kv_share: true, loss_weight: 0.0 }
- { name: answer, max_tokens: 2048, loss_weight: 1.0 }
Ideal — R1-style reasoning has long think phases (16K+) where KV cache is the bottleneck. MLA solves it directly. Likely to become the standard shape for next-gen reasoning models.
1.8 Further reading
- Paper: DeepSeek-V3 Technical Report, arXiv:2412.19437 (2024)
- Prior: DeepSeek-V2, arXiv:2405.04434 — MLA first introduced
- Follow-ups: multiple open-source FlashAttention-3 integrations with MLA
- EulerStack presets:
configs/presets/arch_advanced_mla.yml,llm_{0p1b,0p8b,2b,4b,16b}_mla.yml - Related tutorial: 09 new primitives §4
- Next experiment: split
arch_advanced_mla's schedule asmla_decoder:repeat=8 → mamba_block:repeat=4 → mla_decoder:repeat=8— a 3:1 hybrid. Measure whether MLA and Mamba complement each other.
Case 2. Jamba — Mamba × Attention × MoE hybrid
2.1 Background & motivation
S: Jamba mixes Mamba and attention — is it really just alternating them?
P: Deeper than that. 2023-24 was an SSM (state-space model) renaissance. S4 → H3 → Mamba built a serious case that "attention isn't necessary".
S: What's Mamba's edge?
P: Mamba processes sequences in linear time O(N) with near- attention expressiveness. Concretely:
- Compute: attention
O(N²)vs MambaO(N) - Memory: attention's KV cache grows with
N; Mamba's state is fixed sizeO(1) - Throughput: Mamba is 10-20× faster on long sequences
S: Then why didn't "no attention" win outright?
P: Mamba has a weakness: in-context learning (ICL). Precisely pointing to a past token — "pick the exact key/value pair from context" — is much worse than attention.
S: Why? Mamba also remembers the past.
P: It remembers compressed. Attention's KV cache carries every past token losslessly, and softmax can point to any of them. Mamba state holds the entire past in a fixed-size latent — lossy by design. That compression hurts ICL.
Empirically a Mamba-only model roughly matches attention on LAMBADA (short next-word) but loses badly on needle-in-haystack at 128K.
S: So Jamba comes in.
P: AI21 Labs, March 2024: Jamba-1 validated "mix Mamba and attention." Their ratio landed at Mamba 7 : Attention 1.
S: Why 7:1?
P: Empirically tuned. The minimum attention ratio to recover needle performance was 1/8 (12.5%). Going lower collapsed ICL; going higher gave up Mamba's speed. Sweet spot.
S: How does Jamba differ from similar work?
P:
| Work | Composition | Feature |
|---|---|---|
| Striped Hyena (Together AI, 2023) | Hyena + Attn | Hyena as attention-substitute |
| Samba (Microsoft, 2024) | Mamba + SWA 1:1 | Sliding-window local + SSM global |
| Jamba (AI21, 2024) | Mamba + Attn 7:1 + MoE | MoE replaces FFN for capacity |
| Jamba 1.5 (AI21, 2024-08) | Same shape, scaled | 256K context |
Jamba's distinctive move is MoE in the FFN slot. Attention is already down to 1/8, so capacity goes into the FFN.
2.2 Mechanism in depth
S: Let me see a block diagram.
P: Jamba-32-layer schedule:
Layer 1: Mamba Layer 5: Mamba Layer 9: Mamba ...
Layer 2: Mamba Layer 6: Mamba Layer 10: Mamba
Layer 3: Mamba Layer 7: Mamba Layer 11: Mamba
Layer 4: Attn Layer 8: Attn Layer 12: Attn
↑ 8× repetitions of this pattern → 32 layers, 8 attention layers
Every 8th layer is attention, and at those positions the FFN is MoE:
Layer 4 (Attn + MoE):
x → norm → attention → + residual
→ norm → MoE(8 experts, top-2) → + residual
Layers 1-3 (Mamba):
x → norm → mamba2 → + residual
→ norm → gated_mlp(SwiGLU) → + residual
S: Why is the FFN on Mamba layers plain gated_mlp, not MoE?
P: Two reasons.
- Mamba already blends mixer + FFN. A Mamba block contains state update and input-dependent gating — it's not a "pure mixer" the way attention is. Putting MoE on top is redundant.
- MoE and attention scale on different axes. Attention = token communication; MoE = per-token processing variety. Putting them together is complementary.
S: Why does 1/8 attention suffice for ICL?
P: Solving a needle task only needs "pick a past token exactly" once per forward pass through the model. With 8 attention layers you get 8 global-lookup opportunities. The Mamba layers in between handle local patterns.
2.3 YAML derivation
S: Build the Jamba YAML from scratch.
P: Model metadata first.
schema_version: 1
model:
name: jamba-clone
d_model: 768 # educational. Jamba-1 is d_model=4096
vocab_size: 32000
max_seq_len: 32768
n_heads: 12
n_kv_heads: 4 # GQA 3:1 — extra savings on the attention layers
mlp_ratio: 4
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
S: n_kv_heads: 4 means GQA 3:1. Does Jamba actually use GQA?
P: Yes. Only 1/8 of the layers are attention, but those few are responsible for the entire KV cache. Adding GQA compresses it further. The real Jamba uses GQA 4:1.
S: Next, embedding.
P:
embedding:
type: learned
positional: rope
rope_theta: 500000.0
tie_word_embeddings: true
S: Wait — I heard Mamba doesn't need positional encoding.
P: Mamba is recurrent and encodes position implicitly in its state.
But attention layers still need it. Jamba applies RoPE only to
attention layers. In the YAML you declare positional: rope and the
Mamba layers transparently ignore it.
S: Templates.
P: Two — one for Mamba, one for Attention+MoE.
layer_templates:
mamba_block:
mixer:
type: mamba
mamba:
variant: mamba2 # Mamba2 (SSD formulation)
d_state: 128 # internal state dimension
d_conv: 4 # local conv kernel size
expand: 2 # internal expansion ratio
ffn:
type: gated_mlp
activation: swiglu
norm: { type: rmsnorm, position: pre }
state: { ssm_state: true, kv_cache: false }
attn_moe:
mixer:
type: attention
attention:
qkv_bias: false
ffn:
type: moe
moe:
experts: 8
top_k: 2
router: softmax
capacity_factor: 1.25
z_loss: 0.001
norm: { type: rmsnorm, position: pre }
state: { kv_cache: true }
S: Is d_state: 128 big or small?
P: Mamba2 defaults to 64-128. Larger state = more "memory capacity" but slower. Jamba picks 128.
S: And capacity_factor: 1.25?
P: MoE load-balancing parameter. 1.0 = each expert gets slots for
exactly tokens / experts tokens. 1.25 adds a 25 % buffer so that
training-time routing imbalances don't overflow. At inference you can
lower to 1.0.
S: The schedule is where Jamba lives.
P:
layer_schedule:
# 8 cycles × (3 Mamba + 1 Attn+MoE) = 32 layers
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
S: That's verbose.
P: You can tidy the arithmetic with let::
let:
mamba_per_cycle: 3
total_cycles: 8
# use ${let.mamba_per_cycle * 8 + 8} = 32 in places you need the
# total layer count.
YAML doesn't have for-loops (v1 is Level-1 arithmetic only); that
trade-off is characteristic of dedicated design languages. HDLs hit
the same wall early: Verilog lacked repetition until generate ...
endgenerate was added, and VHDL solved it with its generate
statement. EulerStack's schedule: iterator is on the v1.x roadmap
for the same reason.
S: Head and hints:
P:
head:
type: causal_lm
tie_weights: true
training_hints:
init: normal_0.02
dropout: 0.0
checkpointing: true # 32 layers — save memory
seed: 1234
compatibility:
compile_target: huggingface
2.4 Validate & compile
S: Validate output?
P:
eulerstack validate --preset jamba_clone.yml --report
[validation] schema ... PASS
[validation] cross-field ... PASS
- mamba_block.state.ssm_state=true ↔ mixer=mamba ✓
- attn_moe.state.kv_cache=true ↔ mixer=attention ✓
- MoE: top_k=2 ≤ experts=8 ✓
[params] estimated: ~1.04B
- embedding: ~25M
- mamba × 24: ~240M
- attn+moe × 8: ~280M
- ffn gated_mlp (mamba side): ~480M
- norms, head: ~15M
[realism] PASS
- positional: rope + mamba mixer: OK (attention is present)
- MoE expert ratio: 8 × 1/4 layers: OK
OK: jamba_clone.yml is valid.
S: Why are active parameters smaller?
P: MoE. Total ~1.04B, but each forward activates:
- embedding (25M)
- mamba × 24 (240M)
- attn+moe × 8: attn (~50M) + only 2/8 experts → ~17.5M
Active total: ~330M. "Small-model compute, big-model capacity" — the classic MoE win. DeepSeek-V3 pushes this to 671B total / 37B active.
2.5 Parameter tuning guide
S: Why Mamba:Attn = 7:1?
P: Practical guide:
| Mamba:Attn | ICL (needle) | Speed | Use case |
|---|---|---|---|
| 0:1 (pure attn) | Best | Slow (O(N²)) | Baseline. Llama-family |
| 1:1 (Samba) | Nearly best | Medium | Balanced, long-ctx first |
| 3:1 | High | Fast | General recommendation |
| 7:1 (Jamba) | Sufficient | Fast | Jamba default |
| 15:1 | Noticeable drop | Very fast | Risky — needle fails |
| 1:0 (pure Mamba) | Very low | Fastest | Special use only |
S: experts × top_k?
P:
experts × top_k |
Active ratio | Model | Note |
|---|---|---|---|
| 8 × 2 | 25% | Mixtral, Jamba | Standard. Stable |
| 16 × 2 | 12.5% | Phi-3.5-MoE | More sparse, more capacity |
| 64 × 6 | 9.4% | DeepSeek-V2 | Fine-grained |
| 256 × 8 | 3.1% | DeepSeek-V3 | Extremely fine-grained |
Fine-grained = more experts with smaller dim, top_k grows accordingly. Each expert becomes a narrower specialist. DeepSeek pushed this and showed the gains.
S: capacity_factor?
P:
1.0: theoretical minimum — load-balancing must be near-perfect.1.25: recommended default. 25% buffer during training.1.5: very unbalanced early training; drop to 1.25 once stable.2.0+: rarely useful — wasted memory.
2.6 Traps & debugging
Trap 2.1. Mamba-only model, ICL collapses → Symptom: needle benchmark crashes. Cause: 0% attention. Fix: keep at least 1/8 attention.
Trap 2.2. ssm_state: false on a Mamba layer
state:
ssm_state: false # ← Mamba without state
→ Mamba is defined by its state; the cross-field validator raises
CompatibilityError. ssm_state: true is required.
Trap 2.3. Attention sub-object on a Mamba layer
mixer:
type: mamba
attention: # ← wrong sub-object
latent_dim: 128
→ Validator: "sub-object 'attention' not allowed when type='mamba'".
Trap 2.4. MoE top_k > experts
moe: { experts: 4, top_k: 8 } # impossible
→ ValidationError (R14): top_k=8 > experts=4.
Trap 2.5. positional: none with any attention
→ Allowed, but attention can't distinguish positions → unstable
training. Jamba's answer is positional: rope; Mamba transparently
ignores it.
Trap 2.6. Starting training with capacity_factor: 1.0
→ Early routing is very unbalanced → many dropped tokens → loss spikes.
Start at 1.25, lower once stable.
Trap 2.7. MoE on every single layer
- template: attn_moe
repeat: 32
→ Total params × 8, active only 2/8. Jamba's philosophy is "MoE only on the attention layers." Mixtral-style "MoE everywhere" is a separate design.
2.7 Combining with other primitives
2.7.1 Jamba + MLA — next-gen inference
layer_templates:
attn_mla_moe:
mixer:
type: attention
attention: { latent_dim: 384 }
ffn:
type: moe
moe: { experts: 8, top_k: 2, router: softmax, capacity_factor: 1.25 }
Strong — Jamba's 1/8 attention holds the whole KV cache, so MLA on top is pure win. Conceptually DeepSeek-V3 × Jamba.
2.7.2 Jamba + execution_modes
execution_modes:
- { name: think, max_tokens: 8192, kv_share: true, loss_weight: 0.0 }
- { name: answer, max_tokens: 2048, loss_weight: 1.0 }
Complementary — Mamba handles long think in O(N), attention layers do retrieval for the final answer.
2.7.3 Jamba + Titans memory
layer_templates:
attn_with_mem:
mixer: { type: attention, attention: {} }
ffn: { type: moe, moe: {...} }
memory:
type: neural_memory
update_at_inference: true
params: { hidden: 1024 }
inner_lr: 0.001
Experimental — attach Titans memory only to the attention layers. Test-time learning within multi-turn sessions.
2.7.4 Jamba + Mixture-of-Depths
layer_schedule:
- template: mamba_block
repeat: 3
- template: attn_moe
repeat: 1
depth_gating:
enabled: true
capacity: 0.5
router: top_k
Synergy — only 50% of tokens go through the retrieval (attention) layer. "Easy tokens go through Mamba; only the hard ones need attention."
2.8 Further reading
- Paper: Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model," arXiv:2403.19887 (2024)
- Follow-up: Jamba 1.5 report (2024-08)
- Background: Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752 (2023)
- Related hybrids: Samba (Microsoft, 2024), Zamba (Zyphra, 2024)
- EulerStack presets:
configs/presets/arch_advanced_jamba.yml,configs/presets/arch_advanced_samba.yml,llm_{0p1b,0p8b,2b,4b,16b}_jamba.yml - Related tutorial: 09 new primitives, mixers/02_mamba.md
- Next experiment: change 7:1 to 3:1 and compare against Samba. Needle score at 32K context — how does quality change?
Case 3. DeepSeek-R1 — 2-phase reasoning (execution_modes)
3.1 Background & motivation
S: R1 "reasons first, answers later" — does the architecture itself change?
P: The answer is almost no. R1's architecture is essentially DeepSeek-V3 (MLA + fine-grained MoE). What changes is the training and generation strategy, not the structure.
S: Then why all the buzz?
P: Late 2024 OpenAI announced o1, a black-box model trained with RL to do long chain-of-thought reasoning. The field wondered "how does it work?". In January 2025 DeepSeek released R1 open source, demonstrating what o1 essentially does. The recipe:
- Train a base model (DeepSeek-V3) to produce long reasoning inside
<think>tags. - Learn "reasoning that reaches correct answers" via RL (a GRPO variant).
- At serve time, hide the
thinkphase from the user and output onlyanswer.
S: So structurally it's V3?
P: Exactly. The real innovation is "RL alone can bootstrap reasoning", not a new attention. It's a training paper more than an architecture paper.
S: How does EulerStack model R1 then?
P: v1 exposes execution_modes + transition as declarative
metadata. Keep the architecture standard; carve a "this model does
2-phase reasoning" contract into the config. Training scripts and the
custom generate() loop consume it.
S: Why not an architecture field?
P: Three reasons:
- The structure doesn't actually change. Putting it in a structure field would be lying.
- Many phase strategies exist — R1 (think/answer), Quiet-STaR (per-token rationale), o3 (multi-round). Metadata lets you swap strategies at the YAML-diff level.
- Training / serving pipelines share one contract. Training reads
loss_weight; serving readsvisible_to_user; eval readskv_share. One metadata block.
3.2 Mechanism in depth
S: Show me R1's training pipeline in detail.
P: Four stages.
Stage 1 — Cold start
Input: "Prove that √2 is irrational."
Output: "<think>
Assume √2 = a/b in lowest terms...
Then a² = 2b², so a is even.
Let a = 2c, then 4c² = 2b², b² = 2c²
So b is also even, contradiction.
</think>
<answer>
√2 is irrational because assuming rationality
leads to a contradiction.
</answer>"
Supervised fine-tune with rule-based reward.
Stage 2 — Reasoning-focused RL
- RL (GRPO) with reward based on <think> length and accuracy of the
final answer.
- <answer> stays in a fixed template.
Stage 3 — Rejection sampling + SFT - Collect only the trajectories that reached the correct answer; use them for re-supervised training.
Stage 4 — Full RLHF + preference
- <answer> is shaped for helpfulness / harmlessness.
S: How long is <think> in practice?
P: R1's think averages 2000-4000 tokens, stretching to 16K+ on difficult problems. With a 32K total context, think can occupy 28K sometimes.
S: So max_seq_len has to be large.
P: And KV sharing matters — if you throw away think's KV before answer starts, you waste memory. R1 shares.
S: How is Quiet-STaR different?
P: Zelikman et al. (NeurIPS 2024). Alternative approach:
| R1 | Quiet-STaR | |
|---|---|---|
| Phase shape | One long think, then answer | Short rationale per token |
| Rationale length | 2K-16K | 16 (short) |
| Transition | <think_end> special token |
Fixed-length steps |
| Training signal | Final-answer reward | Next-token loss reduction |
Both fit the same execution_modes schema.
3.3 YAML derivation
S: Let's build R1's YAML from empty.
P: Base is DeepSeek-V3-like. I'll keep MLA and drop MoE to stay simple.
schema_version: 1
model:
name: r1-clone
d_model: 1024 # educational
vocab_size: 32000
max_seq_len: 16384 # think(8K) + answer(2K) + headroom
n_heads: 16
n_kv_heads: 4 # GQA
mlp_ratio: 4
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
embedding:
type: learned
positional: rope
rope_theta: 500000.0 # long context
tie_word_embeddings: true
S: How do I pick max_seq_len?
P: sum(execution_modes.max_tokens) + buffer. Here 8192 + 2048 +
6144 = 16384. The execution_modes_budget realism rule auto-checks it.
S: Template:
P: A single MLA attention template.
layer_templates:
reasoner:
mixer:
type: attention
attention:
latent_dim: 512 # d_model/2 — helps long think
qkv_bias: false
ffn:
type: gated_mlp
activation: swiglu
norm: { type: rmsnorm, position: pre }
state: { kv_cache: true }
(Real R1 also has MoE; we drop it for brevity.)
layer_schedule:
- template: reasoner
repeat: 24
head:
type: causal_lm
tie_weights: true
compatibility:
compile_target: huggingface
S: That's basically V3. Where does R1 kick in?
P: Only the execution_modes block at the bottom.
execution_modes:
- name: think
max_tokens: 8192
kv_share: true # answer reuses think's KV
loss_weight: 0.0 # excluded from primary LM loss
visible_to_user: false # hidden from user at serve time
per_token_rationale: false # R1 style, not Quiet-STaR
- name: answer
max_tokens: 2048
kv_share: false
loss_weight: 1.0 # answer participates in LM loss
visible_to_user: true
per_token_rationale: false
transition:
type: special_token
token: "<think_end>"
S: Explain kv_share: true once more.
P: During think, the model builds a KV cache. When answer starts,
continue using that same cache rather than starting fresh. If
false, answer begins with empty KV and has to "re-read" think —
wasteful. R1 is true on think.
S: per_token_rationale?
P: Quiet-STaR's switch. R1 keeps it false. Quiet-STaR flips it on:
execution_modes:
- name: rationale
max_tokens: 16
per_token_rationale: true
loss_weight: 0.1
visible_to_user: false
- name: answer
max_tokens: 256
loss_weight: 1.0
visible_to_user: true
S: What transition.type values exist?
P:
special_token: phase switches when a token like<think_end>is generated. R1.budget_exhausted: force a switch whenmax_tokensis hit. Anti- infinite-loop guard.
special_token is the common primary; budget_exhausted is a
fallback. Many recipes combine them as "whichever happens first".
3.4 Validate & compile
S: Validate:
P:
eulerstack validate --preset r1_clone.yml --report
[validation] schema ... PASS
[validation] cross-field ... PASS
- MLA latent_dim=512 < d_model=1024 ✓
- d_model % n_heads == 0 (head_dim=64) ✓
[validation] execution_modes ... PASS
- 2 modes: [think, answer]
- transition: special_token ("<think_end>") ✓
- execution_modes_budget: 8192 + 2048 = 10240 ≤ max_seq_len=16384 ✓
[params] estimated: ~480M
[realism] PASS
- kv_share=true + think-before-answer: canonical R1 shape
OK: r1_clone.yml is valid.
S: What ends up in config.v1_extensions after compile?
P:
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("./r1_clone", trust_remote_code=True)
print(cfg.v1_extensions)
# {
# 'execution_modes': [...],
# 'transition': {'type': 'special_token', 'token': '<think_end>'},
# 'schedule_kinds': [...]
# }
This is the shared contract the training script and the serving
generate() both read.
3.5 Parameter tuning guide
S: How big should think.max_tokens be?
P: Depends on difficulty.
| Use case | think max_tokens | answer max_tokens |
|---|---|---|
| Simple QA (history, trivia) | 512 | 512 |
| Mid-level math proof | 2048 | 1024 |
| Olympiad math / code | 8192 | 2048 |
| Multi-step agent | 16384 | 4096 |
| Open-research hard | 32768 | 8192 |
R1 recommends 8K think as default.
S: loss_weight: 0.0 — does that mean think isn't trained?
P: Not trained by primary LM loss. The RL stage uses a separate reward. If you train think with LM loss you end up memorising long strings, which hurts generalisation.
Quiet-STaR uses loss_weight: 0.1 instead — small weight, so LM loss
does contribute a little. Different philosophy — "learn rationales
that improve next-token prediction".
S: When should kv_share be false?
P: Two cases:
- Multi-turn reasoning where each turn is an independent think. Don't share across turns.
- Think-restart protocols that discard failed thinks and retry.
Default: think.kv_share: true, answer.kv_share: false. Answer reuses
think's KV, but answer's own KV is discarded when done.
3.6 Traps & debugging
Trap 3.1. sum(execution_modes.max_tokens) > model.max_seq_len
model: { max_seq_len: 4096 }
execution_modes:
- { name: think, max_tokens: 8192 } # ← budget overflow
- { name: answer, max_tokens: 2048 }
→ execution_modes_budget realism warning. Position overflow at
runtime.
Trap 3.2. transition.token not in the tokenizer vocab
transition:
type: special_token
token: "<think_end>" # without registering, becomes unk
→ Validates fine, but the model splits it into "<", "think_end" etc.
at runtime — transition detection fails. Register the special token:
tokenizer.add_special_tokens({"additional_special_tokens": ["<think_end>"]})
Trap 3.3. "Declaring execution_modes makes it automatically work"
→ EulerStack only preserves the metadata. Phase-aware generate and
RL training must be implemented by the user. Sample generate loop
in examples/r1_generate.py (roadmap v1.2).
Trap 3.4. Using per_token_rationale: true in an R1 mode
→ R1 is "one long think". per_token_rationale is Quiet-STaR-only.
Easy to confuse.
Trap 3.5. kv_share: true but position indexing resets
→ If answer reuses KV, its position indices must continue — answer's
first token is at index len(think) + 1. Easy to miss in
implementations.
Trap 3.6. "loss_weight=0 → not learned" so skip RL too
→ loss_weight is the LM-loss weight; RL reward is separate. If
both are 0, nothing is learned. R1: LM loss 0, RL reward based on
answer correctness.
3.7 Combining with other primitives
3.7.1 R1 + MLA (recommended)
The YAML above already combines them. Long think makes KV cache the bottleneck, and MLA is a natural partner.
3.7.2 R1 + Titans memory
layer_templates:
reasoner_with_memory:
mixer: { type: attention, attention: { latent_dim: 512 } }
ffn: { type: gated_mlp }
memory:
type: neural_memory
update_at_inference: true
params: { hidden: 2048 }
inner_lr: 0.001
persistence: session # think accumulates insight, answer uses it
Experimental — think populates memory; answer consults it. No established academic result yet, but an interesting research direction.
3.7.3 R1 + MoE
Real R1 rides on V3, so it has MoE. We dropped it for clarity; add it back for scale:
ffn:
type: moe
moe: { experts: 64, top_k: 6, router: softmax, capacity_factor: 1.25 }
3.7.4 R1 + Neural-ODE integrator
layer_schedule:
- integrator:
type: ode_rk4
steps: 4
body: reasoner
output: token
Research — interpret reasoning as a continuous ODE flow. CoT as a dynamical system. Early days.
3.8 Further reading
- Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," arXiv:2501.12948 (2025)
- Prior: OpenAI "Learning to Reason with LLMs" blog (o1 announcement, 2024-09)
- Related: Zelikman et al., "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking," arXiv:2403.09629 (NeurIPS 2024)
- Follow-ups: o3 technical report, Kimi k1.5 report
- EulerStack preset:
configs/presets/arch_expert_reasoning_r1.yml - Related tutorial: 09 new primitives §12
- Next experiment: add MLA to
arch_expert_reasoning_r1, push think.max_tokens 8192 → 16384, rope_theta up to 1M — verify a true long-think flow still works.
Case 4. Titans — neural memory that learns at inference time
4.1 Background & motivation
S: "Learns at test time" — does that even make sense?
P: It genuinely does. Titans (Behrouz, Zhong, Mirrokni — Google, 2024-2025) makes it academically rigorous. The idea:
Only a small "memory module" attached to each layer — not the whole model — takes inner gradient steps during inference.
S: Why do we need it? We have a KV cache.
P: Watch what KV cache can't do.
| Mechanism | Parametric? | Grows with seq len? | Update mode |
|---|---|---|---|
| KV cache | ✗ | ✓ linear growth | Token append only |
| Mamba state | ✗ (fixed size) | ✗ | Recurrent update |
| Titans memory | ✓ | ✗ | Gradient step |
Titans' twist: it's a small parametric network whose parameters get gradient-updated during inference. KV stores "what was seen"; Titans memory stores "what was learned from seeing it".
S: Any prior work?
P:
| Work | Year | Idea | Relation to Titans |
|---|---|---|---|
| Memorizing Transformers (Wu et al.) | ICLR 2022 | External KV memory + k-NN lookup | Non-parametric ancestor |
| RETRO (Borgeaud et al.) | ICML 2022 | Retrieval-augmented | External DB; Titans is internal |
| Infini-attention (Google, 2024) | 2024-04 | Compressive memory with MLP | Closest precursor |
| TTT layer (Sun et al., 2024) | 2024-07 | Inner-optim for the whole mixer | Layer-wide; Titans is a side-module |
| Titans (Behrouz et al.) | 2024-12 | Small MLP per layer, inner SGD at inference | ★ |
Think of Titans as the TTT-style inner-optim idea restricted to a small auxiliary memory module — a practical engineering.
S: What can it actually do?
P: From the paper's benchmarks:
- Needle-in-haystack at 2M context: beats Transformer + RAG.
- Ongoing-conversation consistency: session-wide coherence improves.
- Long-form knowledge accumulation: reading a book gradually builds a summary inside memory.
Regular LLMs forget once context overflows; Titans remembers by "summarising into memory." That's what "test-time learning" actually buys.
4.2 Mechanism in depth
S: Math please.
P: The Titans memory module M is a small MLP:
M_θ(x) = W_2 · GeLU(W_1 · LayerNorm(x) + b_1) + b_2
θ = {W_1, b_1, W_2, b_2, m_state}
m_state ∈ R^h ← "persistent memory vector"
Forward is a standard residual:
y = x + g(x) ⊙ M_θ(x) where g(x) = sigmoid(W_g x)
g(x) gates memory's contribution per token.
S: The "inference-time update"?
P: After each forward an optional hook runs:
# 1. compute "surprise"
surprise = || M_θ(x) - x ||² ← how well memory reconstructs the
current hidden
# 2. gradient step on memory params only (not base model!)
θ ← θ - η · ∇_θ surprise
S: Why reconstruction MSE for surprise?
P: Information-theoretic intuition. If memory can't reconstruct x, then x is "unexpected" → worth remembering. Large update, memory learns x. If x is already well-predicted, update is small — already known.
┌───────────────────────────────────────────────────────┐
│ Surprise learning loop │
├───────────────────────────────────────────────────────┤
│ │
│ token_t ──→ layer ──→ hidden_t │
│ │ │
│ ┌───────┴───────┐ │
│ │ │ │
│ forward surprise │
│ │ │ │
│ ▼ ▼ │
│ y = x + M(x) || M(x) - x ||² │
│ │ │
│ ▼ │
│ ∇_θ surprise │
│ │ │
│ ▼ │
│ θ ← θ - η · ∇_θ │
│ │
└───────────────────────────────────────────────────────┘
S: What prevents gradient leaking into the base model?
P: Gradients are isolated to the memory parameters:
# pseudo-code
with torch.enable_grad():
for p in memory.parameters():
p.requires_grad_(True)
for p in base_model.parameters():
p.requires_grad_(False)
surprise = mse(memory(x_det), x_det) # x_det = x.detach()
surprise.backward()
with torch.no_grad():
for p in memory.parameters():
p.sub_(p.grad, alpha=inner_lr)
p.grad = None
x is detached so no grad flows into the outer graph. Only memory
parameters move.
S: persistence?
P: Lifetime policy:
per_query: reset to initial state after each forward. Single-shot.session: persist through a conversation. Reset at the end. Default.persistent: never reset; serialised tomodel.safetensorsonsave_pretrained. Carries over across sessions.
persistent is dangerous for multi-user serving — every user's data
lands in the same memory. Serving should stay on session with per-
user model instances.
4.3 YAML derivation
S: Empty YAML first.
P:
schema_version: 1
model:
name: titans-demo
d_model: 768
vocab_size: 32000
max_seq_len: 8192
n_heads: 12
n_kv_heads: 4
mlp_ratio: 4
tokenizer_contract:
type: hf
pretrained: gpt2
add_bos: true
add_eos: true
embedding:
type: learned
positional: rope
rope_theta: 500000.0
tie_word_embeddings: true
S: Now memory enters the template.
P: Yes. The key piece.
layer_templates:
attn_with_memory:
mixer:
type: attention
attention:
qkv_bias: false
ffn:
type: gated_mlp
activation: swiglu
norm: { type: rmsnorm, position: pre }
state: { kv_cache: true }
memory:
type: neural_memory
update_at_inference: true # enable grad step at inference
params:
hidden: 1024 # internal MLP size
inner_lr: 0.001 # SGD step size
persistence: session # per_query | session | persistent
S: How do I choose hidden: 1024?
P: That knob controls memory's expressive capacity.
hidden |
Added params (d=768) | Use case |
|---|---|---|
| 256 | ~0.4M/layer | Minimal — memory mostly idle |
| 512 | ~0.8M/layer | Light — basic fact memorisation |
| 1024 | ~1.6M/layer | Standard — Titans paper default |
| 2048 | ~3.2M/layer | Larger — long-form summarisation |
| 4096 | ~6.4M/layer | Excessive — overfitting risk |
1× to 2× d_model is the sweet spot.
S: inner_lr: 0.001?
P: Bigger than outer LR (AdamW 3e-4) is OK — inner updates only
touch memory and only one step at a time.
0.0001: very conservative, barely updates0.001: standard0.01: aggressive, fast learning, possible instability0.1+: dangerous — memory overreacts to recent surprises
Titans paper uses 0.001.
S: Does memory need to be on every layer?
P: Optional. For example, only the last 8 of 24 layers:
layer_templates:
plain_attn:
mixer: { type: attention, attention: {} }
ffn: { type: gated_mlp }
attn_with_memory:
mixer: { type: attention, attention: {} }
ffn: { type: gated_mlp }
memory: { ... }
layer_schedule:
- template: plain_attn
repeat: 16 # first 16 — plain
- template: attn_with_memory
repeat: 8 # last 8 — memory
Intuition: "early layers recognise patterns, late layers accumulate memory."
S: Full YAML:
P:
layer_schedule:
- template: attn_with_memory
repeat: 18
head:
type: causal_lm
tie_weights: true
compatibility:
compile_target: huggingface
4.4 Validate & compile
S: Validate:
P:
[validation] schema ... PASS
[validation] memory ... PASS
- type: neural_memory ∈ {neural_memory, associative, retrieval} ✓
- persistence: session ∈ {per_query, session, persistent} ✓
- params.hidden: 1024 > 0 ✓
[params] estimated: ~310M
- base attention + ffn: ~280M
- Titans memory × 18 layers: ~29M
[realism] PASS
OK: titans_demo.yml is valid.
S: How do I confirm memory actually instantiated?
P:
from transformers import AutoModelForCausalLM
from eulerstack.hf.auto_register import register_eulerstack_auto_classes
from eulerstack.components.titans_memory import TitansMemoryModule
register_eulerstack_auto_classes()
model = AutoModelForCausalLM.from_pretrained("./titans_demo",
trust_remote_code=True)
mems = [m for m in model.modules() if isinstance(m, TitansMemoryModule)]
print(f"Titans memory modules: {len(mems)}")
# → Titans memory modules: 18
print(f"params per memory: {sum(p.numel() for p in mems[0].parameters()):,}")
# → params per memory: ~1.6M
S: How to invoke the inference-time update?
P: There's a standardised hook on the model.
model.eval()
ids = tokenizer("Paris is the capital of France.",
return_tensors="pt").input_ids
# 1. forward
out = model(ids, output_hidden_states=True)
# 2. standard hook — this one line *is* test-time learning
surprise = model.step_memory_at_inference(out.hidden_states[-1])
print(surprise)
# {
# 'eulerstack.layers.0.titans_memory': 0.4231,
# 'eulerstack.layers.1.titans_memory': 0.3894,
# ...
# }
Each value is the surprise loss per layer's memory. Over time these values should trend down — memory is getting better at predicting the incoming hiddens, i.e. "learning."
4.5 Parameter tuning guide
S: Which layer depths should carry memory?
P: Two approaches:
Strategy A — every layer (paper default)
layer_schedule:
- template: attn_with_memory
repeat: 24
+10-15 % params, everyone contributes.
Strategy B — back half only (efficiency)
layer_schedule:
- template: plain_attn
repeat: 18
- template: attn_with_memory
repeat: 6 # back 25%
+3 % params, empirically retains ~75 % of memory's gains. Recommended.
Strategy C — middle only (hypothesis)
layer_schedule:
- template: plain_attn
repeat: 8
- template: attn_with_memory
repeat: 8 # middle third
- template: plain_attn
repeat: 8
Research hypothesis: memory helps most on representational-middle layers. Under study.
S: When to use update_at_inference: false?
P: Training-only mode. Outer optimiser learns memory, but at inference it's frozen; the hook is a no-op.
Use cases:
- A/B testing serving — avoid memory drift.
- Train once, freeze for serving — same memory for all users.
- Debugging — isolate effects of inference updates.
Production default: session persistence + update_at_inference: true.
4.6 Traps & debugging
Trap 4.1. persistence: persistent in multi-user serving
→ Privacy/security disaster. All users' data accumulates in the
same memory. A's memory leaks into B's answers. Serving must use
session with isolated per-session model instances.
Trap 4.2. update_at_inference=true + huge inner_lr
memory:
update_at_inference: true
inner_lr: 0.5 # ← too big
→ After a few tokens memory diverges. Answer quality degrades mid-
session. Keep inner_lr ≤ 0.001.
Trap 4.3. Memory on every mixer type indiscriminately
layer_templates:
mamba_with_memory:
mixer: { type: mamba, ... }
memory: { type: neural_memory, ... }
→ Legal, but Mamba already "remembers" via state. Roles overlap and training gets unstable. Keep memory on attention layers.
Trap 4.4. params.hidden < d_model/2
memory:
params: { hidden: 128 } # too small for d_model=768
→ Memory lacks capacity and "remembers nothing". Use
hidden ≥ d_model/2 as minimum.
Trap 4.5. Surprise is ~0 from the start
surprise = model.step_memory_at_inference(hidden)
print(surprise) # {'...': 0.0001, ...}
→ TitansMemoryModule.out_proj.weight is initialised to zero (so the
module starts near-identity). Expected for fresh models. Run a few
training steps, then re-check.
Trap 4.6. Surprise trending up during inference
→ Two common causes:
1. inner_lr too big → overshooting. Lower it.
2. Input distribution shifted drastically (e.g. different language).
Memory is adapting. Observe whether it settles after 10-20 tokens.
4.7 Combining with other primitives
4.7.1 Titans + MLA
layer_templates:
mla_with_memory:
mixer:
type: attention
attention: { latent_dim: 384 }
ffn: { type: gated_mlp }
memory:
type: neural_memory
update_at_inference: true
params: { hidden: 1024 }
inner_lr: 0.001
Strong — MLA shrinks KV cache; memory handles long-term recall. Particularly effective in 128K+ contexts.
4.7.2 Titans + R1 (execution_modes)
layer_templates:
reasoner_with_memory:
mixer: { type: attention, attention: { latent_dim: 512 } }
ffn: { type: gated_mlp }
memory:
type: neural_memory
update_at_inference: true
params: { hidden: 2048 } # bigger memory
inner_lr: 0.001
persistence: session
execution_modes:
- { name: think, max_tokens: 8192, kv_share: true, loss_weight: 0.0 }
- { name: answer, max_tokens: 2048, loss_weight: 1.0 }
High research value — think accumulates insight into memory; answer consults it. Multi-turn sessions can "carry" earlier reasoning. Likely a heavily-studied combination in 2025-26.
4.7.3 Titans + Jamba
layer_templates:
attn_moe_with_memory:
mixer: { type: attention, attention: {} }
ffn:
type: moe
moe: { experts: 8, top_k: 2, router: softmax, capacity_factor: 1.25 }
memory:
type: neural_memory
update_at_inference: true
params: { hidden: 1024 }
Memory only on attention layers (1/8). O(N) + retrieval + long-term remembrance.
4.7.4 Titans + TTT (double inner-optim)
Experimental — TTT already uses inner gradient; adding Titans memory means two inner optimisers in the same model. Stability hard to reason about; no strong empirical reports yet. Research topic.
4.8 Further reading
- Paper: Behrouz, Zhong, Mirrokni, "Titans: Learning to Memorize at Test Time," Google (2024-12 / 2025)
- Predecessor: Wu et al., "Memorizing Transformers," ICLR 2022
- Predecessor: Borgeaud et al., "RETRO," ICML 2022
- Related: Munkhdalai et al., "Infini-attention," arXiv:2404.07143 (2024)
- Conceptual ancestor: Ha & Schmidhuber, "Fast Weights" (2017)
- EulerStack preset:
configs/presets/arch_expert_titans_memory.yml - Related tutorial: 09 new primitives §10
- Tests:
tests/test_titans_memory_runtime.py(12 tests — module unit through HF round-trip) - Next experiment: run
arch_expert_titans_memorywithinner_lr ∈ {0.0001, 0.001, 0.01}, compare final-session surprise. At what value does "learning" actually happen?
Closing — common shape across the four cases
| Case | Changed primitive | EulerStack field | Runtime (v1.1) | Core idea |
|---|---|---|---|---|
| DeepSeek-V3 | KV-compressed attention | attention.latent_dim |
✅ Core | Trade dimensionality for memory; keep quality |
| Jamba | Mamba/attention hybrid + MoE | layer_schedule + ffn.type: moe |
✅ Core | 7:1 mix captures both speed and ICL |
| DeepSeek-R1 | Reasoning-phase separation | execution_modes: + transition: |
✅ Core (metadata) | Structure unchanged, recipe changes |
| Titans | Test-time parametric memory | template.memory: + step_memory_at_inference |
✅ Core (v1.1) | Grad step on memory only, at inference |
Four common lessons
- "One YAML file ≈ one paper." In every case the architectural
change lives in 5-20 lines of YAML, not 200 lines of
modeling_custom.py. - Orthogonality by design — MLA doesn't touch FFN, MoE doesn't
touch attention,
execution_modesdoesn't touch structure, Titans memory doesn't touch forward path. This is why combinations just work. - Metadata first, runtime progressively — schema-first. R1's
execution_modesworks for round-trip even before generate is fully wired. Titans shipped schema in v1.0 and runtime in v1.1. - Lego blocks really compose — combinations are arithmetic —
MLA × Jamba × R1 × Titans adds up to a four-paper YAML. The §1.7 /
§2.7 / §3.7 / §4.7 combination examples roll up into
arch_expert_kitchen_sink.yml.
Next
- Port your own 5th paper using the same eight-step template:
background → mechanism → YAML → validate → tuning → traps →
combinations → further reading. If your YAML passes
eulerstack validate --reportyou almost certainly captured the paper correctly. - Diff the four flagship presets (
arch_advanced_mla,arch_advanced_jamba,arch_expert_reasoning_r1,arch_expert_titans_memory) to see how "one primitive changes at a time" looks in practice. - The capstone —
arch_expert_kitchen_sinkcombines every v1 primitive in one spec, andtests/test_kitchen_sink_preset.pyauto-verifies that the full pipeline (validate → compile → save/load → HF training) still works. - Revisit Tutorial 9 for per-field reference after this high-level tour.
- Browse all 53 presets:
eulerstack --lang en presets list.