8. Expert Mini Preset Walkthrough

This tutorial introduces the 6 arch_expert_*_mini presets. All of them are in the ~80M–360M parameter range — small-scale counterparts of the ~2B arch_expert_* variants covered in 07_arch_walkthrough.md. These sizes are trainable on a single consumer GPU, which makes them the ideal step for "touching the design space by running code" before committing to a full 2B training run.

Why Mini Presets Exist

The full-scale arch_expert_* presets are unified at ~2B parameters so that architectural choices can be compared at a fixed budget. But ~2B is:

Tight even on RTX 3090/4090/5090 with AdamW + bf16
Multiple hours per ablation → iteration is slow
Expensive just to "try an idea"

Mini presets solve this by keeping the ideas, shrinking the scale. Common principles:

d_model 384–512
max_seq_len 4K–16K (full-scale uses 8K–128K)
~12 layers (full-scale: 32)
MoE experts reduced to 4–16 with top_k 1–2
dropout 0.1 to prevent overfitting (full-scale minis use 0.0)

How Shrinking Changes Architectural Meaning (Important)

MoE often does NOT work as well at mini scale as at 2B+. Reasons:

Weaker router specialization: experts need enough width AND data to carve out meaningful subspaces. With d_model=384 the bandwidth is narrow, making router collapse (one expert dominates) much more likely.
Routing overhead relatively larger: in small models, MoE router and dispatch cost become proportionally larger, and per-token FLOPs may actually exceed a dense model of the same active-param count.
Fine-grained routing benefit shrinks: DeepSeekMoE's "32 × top-3" strategy has public evidence mainly at 2B+. Below that, thin experts fail to specialize and fine-grained routing often underperforms dense.

The minis in this tutorial are designed so you can directly experience these failure modes. deepseek_moe_mini and frontier_full_moe_mini are deliberately positioned as "show the failure mode" teaching presets. Poor performance on those two presets is an intended outcome, not a bug.

Recommended Experiment Order

1. arch_expert_progressive_stack_mini    ← start here (clean zones, easy to debug)
2. arch_expert_blackmamba_moe_mini       ← partial sparse — most plausible mini MoE
3. arch_expert_mixtral_moe_mini          ← classical MoE baseline
4. arch_expert_dilated_longnet_mini      ← long-context pyramid experiment
5. arch_expert_deepseek_moe_mini         ← observe fine-grained MoE failure
6. arch_expert_frontier_full_moe_mini    ← most experimental (failure expected)

The same order is enforced by the integration test (tests/integration/expert_mini_e2e/conftest.py::_RECOMMENDED_ORDER).

Run:

eulerstack presets list | grep _mini
eulerstack explain --preset configs/presets/arch_expert_progressive_stack_mini.yml
eulerstack compile --preset configs/presets/arch_expert_progressive_stack_mini.yml \
    --output-dir ./out/progressive_mini

Mini 01: `arch_expert_progressive_stack_mini` (RECOMMENDED FIRST)

Depth-wise monotonically increasing mixer complexity. Applies the representation hierarchy hypothesis from vision (CNN → attention) and biology (convolutional early layers → transformer late layers) directly to LLM layer stacks.

Depth zones (12 layers): - Zone 1 (1–3): Hyena — cheapest (O(N log N) FFT conv), broad pattern capture - Zone 2 (4–7): Mamba2 — linear-time selective SSM, bulk token summarization - Zone 3 (8–10): RetNet — chunkwise retention, stable train/infer handoff - Zone 4 (11–12): Attention + MoE (4 × top-1) — exact recall + conditional capacity

d_model: 512
max_seq_len: 8192
layer_schedule:
  - { template: hyena_dense,  repeat: 3 }
  - { template: mamba_dense,  repeat: 4 }
  - { template: retnet_dense, repeat: 3 }
  - { template: attn_moe,     repeat: 2 }

Strengths - Clean zone separation — replacing a zone with identity is a meaningful ablation - MoE concentrated in just 2 tail layers → low router-collapse risk - ~86M total params → trains comfortably on 12–16 GB GPUs

Weaknesses - Includes all 4 mixer families → correctness depends on all 4 implementations - Hyena FFT kernel offers little advantage over attention at short seq (<256)

Basis: Poli et al. 2023 (Hyena); Gu & Dao 2023/2024 (Mamba); Sun et al. 2023 (RetNet); Fedus et al. 2022 (MoE); vision conv→attention hierarchy literature.

Mini 02: `arch_expert_blackmamba_moe_mini`

Mamba (linear-time, high inference throughput) as the bulk mixer, with sparse attention as "anchors" for exact recall. MoE is applied on the non-attention (Mamba) layers — demonstrating conditional compute works on SSM families too.

Pattern (12 layers):

(mamba_moe × 3, attn_anchor × 1) × 3 = 12 layers
  → 9 Mamba+MoE + 3 global attention anchors

d_model: 512
max_seq_len: 8192
layer_schedule:
  - { template: mamba_moe,    repeat: 3 }
  - { template: attn_anchor,  repeat: 1 }
  - { template: mamba_moe,    repeat: 3 }
  - { template: attn_anchor,  repeat: 1 }
  - { template: mamba_moe,    repeat: 3 }
  - { template: attn_anchor,  repeat: 1 }

Strengths - Mamba's O(N) scaling helps even at small scale with long contexts - 3 attention anchors provide exact-recall that Mamba alone struggles with - MoE confined to Mamba layers — structurally clean

Weaknesses (small-scale specific) - MoE (4 × top-1) specialization is weaker than at 2B+ - Router collapse is possible in the first few hundred steps — trust z_loss

Basis: Anthony et al. 2024 (BlackMamba, Zyphra); Pióro et al. 2024 (MoE-Mamba); Gu & Dao 2023/2024 (Mamba2).

Mini 03: `arch_expert_mixtral_moe_mini`

The small-scale MoE baseline. Pure attention + MoE-FFN on every layer — the textbook sparse configuration. Useful as a control against which partial-sparse designs (blackmamba_mini, progressive_stack_mini) should be compared.

d_model: 512
max_seq_len: 4096
layer_schedule:
  - { template: attn_moe, repeat: 12 }  # attn + MoE (4 × top-2) every layer

Strengths - Cleanest illustration of "conditional compute for capacity" at small scale - Compatible with all HF tooling (attention-based)

Weaknesses (small-scale specific) - Mixtral was 8 × top-2 @ 46.7B — shrinking uniformly to 4 × top-2 @ 150M means expert specialization barely happens - Every-layer MoE adds both FLOPs and optimization difficulty - Likely slightly worse sample efficiency than an equivalent dense attention model of the same active params

Why include it: if "MoE everywhere" underperforms "MoE only on tail", that itself is useful ablation evidence. A classic MoE baseline is educationally essential.

Basis: Jiang et al. 2024 (Mixtral); Fedus et al. 2022 (Switch Transformers); Shazeer et al. 2017 (MoE foundations).

Mini 04: `arch_expert_dilated_longnet_mini`

Small-scale long-context experiment. Temporal pyramid: - Mamba prefix (2 layers) for O(N) bulk processing - Sliding-window attention widening stepwise (512 → 2048 → 8192) - Global attention + MoE tail (2 layers)

RoPE uses scaling_factor=2.0 (linear) for safe positional encoding up to 16K.

d_model: 512
max_seq_len: 16384
rope_scaling: { type: linear, factor: 2.0 }
layer_schedule:
  - { template: mamba_prefix, repeat: 2 }
  - { template: sw_512,       repeat: 2 }
  - { template: sw_2048,      repeat: 3 }
  - { template: sw_8192,      repeat: 3 }
  - { template: global_moe,   repeat: 2 }

Strengths - Mimics LongNet's receptive-field expansion without dilated kernels — simpler - Enables 16K-context experiments at small scale

Weaknesses (small-scale specific) - 16K context + sliding-window KV cache still burdens VRAM; under 24 GB drop max_seq_len to 4K but keep the window hierarchy - Global MoE tail is only 2 layers × 4 experts — specialization gains are limited (kept for pipeline parity)

Basis: Ding et al. 2023 (LongNet); Lieber et al. 2024 (Jamba); Gu & Dao 2023/2024.

Mini 05: `arch_expert_deepseek_moe_mini` (⚠ observe-failure preset)

Weakens DeepSeekMoE's fine-grained routing (32 × top-3 @ 2B) to 16 × top-2 @ 350M. This mini is positioned as a teaching vehicle for observing fine- grained MoE failure at small scale.

d_model: 384
max_seq_len: 4096
target_params: 350_000_000     # total (includes 16 experts × MLP); ~60M active
layer_schedule:
  - { template: fine_grained_moe, repeat: 12 }  # 16 experts × top-2 every layer

Claimed strengths at 2B+ - Splitting experts finely gives each a sharper, more specific function - Higher top-k increases routing flexibility for complex compositions

Expected failure modes at mini scale (you will feel them) - Splitting d_model=384 into 16 experts leaves each one too thin to learn a meaningful subspace - Top-2 router entropy stays high → slow convergence (budget ≥10k steps) - Likely worse than a dense model of equivalent active params - DeepSeek-V2/V3's shared-expert concept is NOT in the current schema — this is an approximation

Why include it 1. Cheapest reproduction of fine-grained MoE failure 2. Sanity-checks the MoE routing code when experts > heads 3. Contrast against blackmamba_mini (partial sparse) shows how to AVOID over-finely-grained routing at small scale

Basis: Dai et al. 2024 (DeepSeekMoE); DeepSeek-AI 2024 (V2/V3).

Mini 06: `arch_expert_frontier_full_moe_mini` (⚠ most experimental — expected to fail)

The most speculative of the 6 minis. The full-scale version combines three frontier ideas at once — the mini keeps the same structure:

No attention at all (Mamba + Hyena + RetNet only)
MoE on every layer (4 × top-1)
Three non-attention mixers rotating (mamba×2, hyena×1, retnet×1) × 3

d_model: 384
max_seq_len: 8192
layer_schedule:
  - { template: mamba_moe,  repeat: 2 }
  - { template: hyena_moe,  repeat: 1 }
  - { template: retnet_moe, repeat: 1 }
  # ... × 3 total → 12 layers, 0 attention

Why failure is expected at mini scale (this preset is a teaching tool)

Attention-free = no exact-recall anchor - Mamba/Hyena/RetNet approximate in-context recall — none do exact token-level matching as well as even one attention layer - At 2B+ parameter capacity compensates; at 120M that margin is gone - Expect collapse on "find the token that appeared N positions ago" tasks
Every-layer MoE at d_model=384 - Router specialization needs both width AND data — both are lacking - Router collapse likely despite z_loss
Multi-mixer debugging hell - A loss spike could come from any of 3 mixer families — attribution is hard

When to use this mini anyway - Pipeline sanity check (if this runs without crashes, MoE routing + all 3 mixer implementations are numerically healthy) - Counterexample of what NOT to attempt at 100M - Smoke test for numeric-stability regressions

Basis: Anthony et al. 2024 (BlackMamba); Pióro et al. 2024 (MoE-Mamba); Poli et al. 2023 (Hyena); Sun et al. 2023 (RetNet); Fedus et al. 2022 (Switch).

Comparison Table

Preset	~Total	~Active	Layers	Mixer composition	MoE	Difficulty
`progressive_stack_mini`	~86M	~86M	12	Hyena→Mamba→RetNet→Attn+MoE	2 layers	⭐ (easy)
`blackmamba_moe_mini`	~156M	~90M	12	Mamba+MoE + Attn anchor	9 layers	⭐⭐
`mixtral_moe_mini`	~175M	~90M	12	Attn+MoE (every layer)	12 layers	⭐⭐
`dilated_longnet_mini`	~83M	~75M	12	Mamba + SW-attn pyramid + Global+MoE	2 layers	⭐⭐⭐
`deepseek_moe_mini`	~357M	~60M	12	Attn + FG-MoE (16×top-2)	12 layers	⭐⭐⭐⭐
`frontier_full_moe_mini`	~106M	~60M	12	Mamba/Hyena/RetNet + MoE (all)	12 layers	⭐⭐⭐⭐⭐

Integration Test

RUN_EXPERT_MINI_E2E=1 python -m pytest tests/integration/expert_mini_e2e/ \
    --mini-presets=all -v -s

Gate: RUN_EXPERT_MINI_E2E=1
Defaults: 1000 steps, batch_size=1, max_tokens=512, dolma_10k
Per-model recommended lr: conftest.py::RECOMMENDED_LR (lower for riskier minis)
Log directory: outputs/expert_mini_e2e/

Details: docs/architectures/tutorials/arch_e2e_guide.md

Next Steps

07_arch_walkthrough.md: ~2B full-scale arch_expert_* comparison
Mixer overview: principles and convergence behavior of each mixer
Write your own mini: following the design principles (d_model 384–512, ~12 layers, weakened MoE) you can port any idea from arch_walkthrough into a mini.

← Prev 7. Skill-Level Architecture Walkthrough 9. v1 Phase B New Primitives (MLA / Titans / MoD / Dual-Stream / Neural-ODE / TTT) Next →

8. Expert Mini Preset Walkthrough

Why Mini Presets Exist

How Shrinking Changes Architectural Meaning (Important)

Recommended Experiment Order

Mini 01: arch_expert_progressive_stack_mini (RECOMMENDED FIRST)

Mini 02: arch_expert_blackmamba_moe_mini

Mini 03: arch_expert_mixtral_moe_mini

Mini 04: arch_expert_dilated_longnet_mini

Mini 05: arch_expert_deepseek_moe_mini (⚠ observe-failure preset)

Mini 06: arch_expert_frontier_full_moe_mini (⚠ most experimental — expected to fail)