8. Expert Mini Preset Walkthrough
This tutorial introduces the 6 arch_expert_*_mini presets. All of them are in
the ~80M–360M parameter range — small-scale counterparts of the ~2B
arch_expert_* variants covered in 07_arch_walkthrough.md.
These sizes are trainable on a single consumer GPU, which makes them the ideal
step for "touching the design space by running code" before committing to a full
2B training run.
Why Mini Presets Exist
The full-scale arch_expert_* presets are unified at ~2B parameters so that
architectural choices can be compared at a fixed budget. But ~2B is:
- Tight even on RTX 3090/4090/5090 with AdamW + bf16
- Multiple hours per ablation → iteration is slow
- Expensive just to "try an idea"
Mini presets solve this by keeping the ideas, shrinking the scale. Common principles:
d_model384–512max_seq_len4K–16K (full-scale uses 8K–128K)- ~12 layers (full-scale: 32)
- MoE experts reduced to 4–16 with
top_k1–2 dropout 0.1to prevent overfitting (full-scale minis use 0.0)
How Shrinking Changes Architectural Meaning (Important)
MoE often does NOT work as well at mini scale as at 2B+. Reasons:
- Weaker router specialization: experts need enough width AND data to
carve out meaningful subspaces. With
d_model=384the bandwidth is narrow, making router collapse (one expert dominates) much more likely. - Routing overhead relatively larger: in small models, MoE router and dispatch cost become proportionally larger, and per-token FLOPs may actually exceed a dense model of the same active-param count.
- Fine-grained routing benefit shrinks: DeepSeekMoE's "32 × top-3" strategy has public evidence mainly at 2B+. Below that, thin experts fail to specialize and fine-grained routing often underperforms dense.
The minis in this tutorial are designed so you can directly experience these
failure modes. deepseek_moe_mini and frontier_full_moe_mini are deliberately
positioned as "show the failure mode" teaching presets. Poor performance on
those two presets is an intended outcome, not a bug.
Recommended Experiment Order
1. arch_expert_progressive_stack_mini ← start here (clean zones, easy to debug)
2. arch_expert_blackmamba_moe_mini ← partial sparse — most plausible mini MoE
3. arch_expert_mixtral_moe_mini ← classical MoE baseline
4. arch_expert_dilated_longnet_mini ← long-context pyramid experiment
5. arch_expert_deepseek_moe_mini ← observe fine-grained MoE failure
6. arch_expert_frontier_full_moe_mini ← most experimental (failure expected)
The same order is enforced by the integration test
(tests/integration/expert_mini_e2e/conftest.py::_RECOMMENDED_ORDER).
Run:
eulerstack presets list | grep _mini
eulerstack explain --preset configs/presets/arch_expert_progressive_stack_mini.yml
eulerstack compile --preset configs/presets/arch_expert_progressive_stack_mini.yml \
--output-dir ./out/progressive_mini
Mini 01: arch_expert_progressive_stack_mini (RECOMMENDED FIRST)
Depth-wise monotonically increasing mixer complexity. Applies the representation hierarchy hypothesis from vision (CNN → attention) and biology (convolutional early layers → transformer late layers) directly to LLM layer stacks.
Depth zones (12 layers): - Zone 1 (1–3): Hyena — cheapest (O(N log N) FFT conv), broad pattern capture - Zone 2 (4–7): Mamba2 — linear-time selective SSM, bulk token summarization - Zone 3 (8–10): RetNet — chunkwise retention, stable train/infer handoff - Zone 4 (11–12): Attention + MoE (4 × top-1) — exact recall + conditional capacity
d_model: 512
max_seq_len: 8192
layer_schedule:
- { template: hyena_dense, repeat: 3 }
- { template: mamba_dense, repeat: 4 }
- { template: retnet_dense, repeat: 3 }
- { template: attn_moe, repeat: 2 }
Strengths - Clean zone separation — replacing a zone with identity is a meaningful ablation - MoE concentrated in just 2 tail layers → low router-collapse risk - ~86M total params → trains comfortably on 12–16 GB GPUs
Weaknesses - Includes all 4 mixer families → correctness depends on all 4 implementations - Hyena FFT kernel offers little advantage over attention at short seq (<256)
Basis: Poli et al. 2023 (Hyena); Gu & Dao 2023/2024 (Mamba); Sun et al. 2023 (RetNet); Fedus et al. 2022 (MoE); vision conv→attention hierarchy literature.
Mini 02: arch_expert_blackmamba_moe_mini
Mamba (linear-time, high inference throughput) as the bulk mixer, with sparse attention as "anchors" for exact recall. MoE is applied on the non-attention (Mamba) layers — demonstrating conditional compute works on SSM families too.
Pattern (12 layers):
(mamba_moe × 3, attn_anchor × 1) × 3 = 12 layers
→ 9 Mamba+MoE + 3 global attention anchors
d_model: 512
max_seq_len: 8192
layer_schedule:
- { template: mamba_moe, repeat: 3 }
- { template: attn_anchor, repeat: 1 }
- { template: mamba_moe, repeat: 3 }
- { template: attn_anchor, repeat: 1 }
- { template: mamba_moe, repeat: 3 }
- { template: attn_anchor, repeat: 1 }
Strengths - Mamba's O(N) scaling helps even at small scale with long contexts - 3 attention anchors provide exact-recall that Mamba alone struggles with - MoE confined to Mamba layers — structurally clean
Weaknesses (small-scale specific) - MoE (4 × top-1) specialization is weaker than at 2B+ - Router collapse is possible in the first few hundred steps — trust z_loss
Basis: Anthony et al. 2024 (BlackMamba, Zyphra); Pióro et al. 2024 (MoE-Mamba); Gu & Dao 2023/2024 (Mamba2).
Mini 03: arch_expert_mixtral_moe_mini
The small-scale MoE baseline. Pure attention + MoE-FFN on every layer — the
textbook sparse configuration. Useful as a control against which partial-sparse
designs (blackmamba_mini, progressive_stack_mini) should be compared.
d_model: 512
max_seq_len: 4096
layer_schedule:
- { template: attn_moe, repeat: 12 } # attn + MoE (4 × top-2) every layer
Strengths - Cleanest illustration of "conditional compute for capacity" at small scale - Compatible with all HF tooling (attention-based)
Weaknesses (small-scale specific) - Mixtral was 8 × top-2 @ 46.7B — shrinking uniformly to 4 × top-2 @ 150M means expert specialization barely happens - Every-layer MoE adds both FLOPs and optimization difficulty - Likely slightly worse sample efficiency than an equivalent dense attention model of the same active params
Why include it: if "MoE everywhere" underperforms "MoE only on tail", that itself is useful ablation evidence. A classic MoE baseline is educationally essential.
Basis: Jiang et al. 2024 (Mixtral); Fedus et al. 2022 (Switch Transformers); Shazeer et al. 2017 (MoE foundations).
Mini 04: arch_expert_dilated_longnet_mini
Small-scale long-context experiment. Temporal pyramid: - Mamba prefix (2 layers) for O(N) bulk processing - Sliding-window attention widening stepwise (512 → 2048 → 8192) - Global attention + MoE tail (2 layers)
RoPE uses scaling_factor=2.0 (linear) for safe positional encoding up to 16K.
d_model: 512
max_seq_len: 16384
rope_scaling: { type: linear, factor: 2.0 }
layer_schedule:
- { template: mamba_prefix, repeat: 2 }
- { template: sw_512, repeat: 2 }
- { template: sw_2048, repeat: 3 }
- { template: sw_8192, repeat: 3 }
- { template: global_moe, repeat: 2 }
Strengths - Mimics LongNet's receptive-field expansion without dilated kernels — simpler - Enables 16K-context experiments at small scale
Weaknesses (small-scale specific)
- 16K context + sliding-window KV cache still burdens VRAM; under 24 GB drop
max_seq_len to 4K but keep the window hierarchy
- Global MoE tail is only 2 layers × 4 experts — specialization gains are
limited (kept for pipeline parity)
Basis: Ding et al. 2023 (LongNet); Lieber et al. 2024 (Jamba); Gu & Dao 2023/2024.
Mini 05: arch_expert_deepseek_moe_mini (⚠ observe-failure preset)
Weakens DeepSeekMoE's fine-grained routing (32 × top-3 @ 2B) to 16 × top-2 @ 350M. This mini is positioned as a teaching vehicle for observing fine- grained MoE failure at small scale.
d_model: 384
max_seq_len: 4096
target_params: 350_000_000 # total (includes 16 experts × MLP); ~60M active
layer_schedule:
- { template: fine_grained_moe, repeat: 12 } # 16 experts × top-2 every layer
Claimed strengths at 2B+ - Splitting experts finely gives each a sharper, more specific function - Higher top-k increases routing flexibility for complex compositions
Expected failure modes at mini scale (you will feel them)
- Splitting d_model=384 into 16 experts leaves each one too thin to learn
a meaningful subspace
- Top-2 router entropy stays high → slow convergence (budget ≥10k steps)
- Likely worse than a dense model of equivalent active params
- DeepSeek-V2/V3's shared-expert concept is NOT in the current schema — this
is an approximation
Why include it
1. Cheapest reproduction of fine-grained MoE failure
2. Sanity-checks the MoE routing code when experts > heads
3. Contrast against blackmamba_mini (partial sparse) shows how to AVOID
over-finely-grained routing at small scale
Basis: Dai et al. 2024 (DeepSeekMoE); DeepSeek-AI 2024 (V2/V3).
Mini 06: arch_expert_frontier_full_moe_mini (⚠ most experimental — expected to fail)
The most speculative of the 6 minis. The full-scale version combines three frontier ideas at once — the mini keeps the same structure:
- No attention at all (Mamba + Hyena + RetNet only)
- MoE on every layer (4 × top-1)
- Three non-attention mixers rotating (mamba×2, hyena×1, retnet×1) × 3
d_model: 384
max_seq_len: 8192
layer_schedule:
- { template: mamba_moe, repeat: 2 }
- { template: hyena_moe, repeat: 1 }
- { template: retnet_moe, repeat: 1 }
# ... × 3 total → 12 layers, 0 attention
Why failure is expected at mini scale (this preset is a teaching tool)
-
Attention-free = no exact-recall anchor - Mamba/Hyena/RetNet approximate in-context recall — none do exact token-level matching as well as even one attention layer - At 2B+ parameter capacity compensates; at 120M that margin is gone - Expect collapse on "find the token that appeared N positions ago" tasks
-
Every-layer MoE at d_model=384 - Router specialization needs both width AND data — both are lacking - Router collapse likely despite z_loss
-
Multi-mixer debugging hell - A loss spike could come from any of 3 mixer families — attribution is hard
When to use this mini anyway - Pipeline sanity check (if this runs without crashes, MoE routing + all 3 mixer implementations are numerically healthy) - Counterexample of what NOT to attempt at 100M - Smoke test for numeric-stability regressions
Basis: Anthony et al. 2024 (BlackMamba); Pióro et al. 2024 (MoE-Mamba); Poli et al. 2023 (Hyena); Sun et al. 2023 (RetNet); Fedus et al. 2022 (Switch).
Comparison Table
| Preset | ~Total | ~Active | Layers | Mixer composition | MoE | Difficulty |
|---|---|---|---|---|---|---|
progressive_stack_mini |
~86M | ~86M | 12 | Hyena→Mamba→RetNet→Attn+MoE | 2 layers | ⭐ (easy) |
blackmamba_moe_mini |
~156M | ~90M | 12 | Mamba+MoE + Attn anchor | 9 layers | ⭐⭐ |
mixtral_moe_mini |
~175M | ~90M | 12 | Attn+MoE (every layer) | 12 layers | ⭐⭐ |
dilated_longnet_mini |
~83M | ~75M | 12 | Mamba + SW-attn pyramid + Global+MoE | 2 layers | ⭐⭐⭐ |
deepseek_moe_mini |
~357M | ~60M | 12 | Attn + FG-MoE (16×top-2) | 12 layers | ⭐⭐⭐⭐ |
frontier_full_moe_mini |
~106M | ~60M | 12 | Mamba/Hyena/RetNet + MoE (all) | 12 layers | ⭐⭐⭐⭐⭐ |
Integration Test
RUN_EXPERT_MINI_E2E=1 python -m pytest tests/integration/expert_mini_e2e/ \
--mini-presets=all -v -s
- Gate:
RUN_EXPERT_MINI_E2E=1 - Defaults: 1000 steps,
batch_size=1,max_tokens=512, dolma_10k - Per-model recommended lr:
conftest.py::RECOMMENDED_LR(lower for riskier minis) - Log directory:
outputs/expert_mini_e2e/
Details: docs/architectures/tutorials/arch_e2e_guide.md
Next Steps
- 07_arch_walkthrough.md: ~2B full-scale arch_expert_* comparison
- Mixer overview: principles and convergence behavior of each mixer
- Write your own mini: following the design principles (
d_model384–512, ~12 layers, weakened MoE) you can port any idea fromarch_walkthroughinto a mini.