7. Skill-Level Architecture Walkthrough
This tutorial walks through all 20 arch_*.yml presets from beginner to expert
so you can see modern LLM architecture evolution as a sequence of concrete design
choices. Every preset targets roughly 1–2B parameters, so differences are purely
architectural rather than size effects.
Why multiple presets per level
Each level (beginner / intermediate / advanced / expert) contains competing approaches to the same problem. To understand why a given design is good, you have to compare it against its alternatives at the same level. For example:
- Intermediate: Mistral 1:3 sliding vs Gemma 2 1:1 alternating vs Qwen RoPE scaling → three answers to "how to improve attention".
- Advanced: Jamba 3:1 vs Samba 1:1 vs Pure RetNet → three answers to "how to escape O(N²)".
Running and comparing presets within a level is where the learning happens.
Run commands:
eulerstack presets list
eulerstack explain --preset configs/presets/arch_beginner_gpt2.yml
eulerstack validate --preset configs/presets/arch_beginner_gpt2.yml --report
eulerstack compile --preset configs/presets/arch_beginner_gpt2.yml --output-dir ./out/gpt2
Level 1: BEGINNER — "Understand the starting point"
1-A. arch_beginner_gpt2 — Classic Transformer (2019 baseline)
The GPT-2-era choices, preserved:
- Multi-Head Attention (no GQA;
n_kv_heads == n_heads) - LayerNorm, post-norm (normalize after residual)
- MLP + GeLU (no gating)
- qkv_bias: true
Why start here? To understand what modern LLMs changed, you need to know what came before.
Refs: Vaswani et al. 2017; Radford et al. 2019 (GPT-2); Ba et al. 2016.
1-B. arch_beginner_llama — Modern baseline (2023 standard)
Llama 2/3 defaults:
- Grouped-Query Attention (4:1)
- RMSNorm, pre-norm
- Gated MLP + SwiGLU
- qkv_bias: false
Takeaway (1-A vs 1-B):
| Axis | Classic (GPT-2) | Modern (Llama) | Benefit |
|---|---|---|---|
| Norm type/position | LayerNorm, post | RMSNorm, pre | Training stability |
| FFN | MLP + GeLU | Gated MLP + SwiGLU | Expressiveness |
| Head sharing | MHA | GQA 4:1 | 75% KV cache savings |
| Bias | Yes | No | Fewer params |
Level 2: INTERMEDIATE — "Push attention harder without replacing it"
Three competing ways to scale attention cheaply:
2-A. arch_intermediate_mistral — Sparse global + dense local (1:3)
1-in-4 global, rest are 4096-token sliding window. ~4x KV cache savings while global reasoning is preserved periodically. Refs: Jiang et al. 2023 (Mistral 7B).
2-B. arch_intermediate_gemma2 — Alternating global:local (1:1)
Every other layer is global. Doubles global-layer density vs Mistral — more frequent full-context reasoning, less aggressive KV savings. Refs: Team Gemma, 2024 (Gemma 2, Google DeepMind).
2-C. arch_intermediate_qwen_longctx — RoPE scaling for long context
Leave attention structure alone; extend positional encoding instead
(rope_theta=1e6, linear scaling factor 4, max_seq_len=32K).
Refs: Chen et al. 2023 (Position Interpolation); Qwen 2/3 reports.
Takeaway:
| Approach | Tradeoff | Best when |
|---|---|---|
| Mistral (1:3) | Most KV savings | Memory-bound inference |
| Gemma 2 (1:1) | More global reasoning | Quality-bound tasks |
| Qwen (RoPE scaling) | No structural change | Extending existing weights to long context |
Level 3: ADVANCED — "Replace attention, partially or fully"
3-A. arch_advanced_jamba — Mamba + Attention 3:1
75% Mamba2 SSM (O(N) bulk processing), 25% attention (retrieval anchor). Refs: Lieber et al. 2024 (Jamba, AI21).
3-B. arch_advanced_samba — Mamba + Sliding Window 1:1
Alternate Mamba with sliding-window attention. Keeps attention in every layer but makes each one cheap. Refs: Ren et al. 2024 (Samba, Microsoft).
3-C. arch_advanced_retnet — Pure RetNet (attention-free)
All layers use retention (exponential decay replaces softmax). Three mathematically equivalent modes (parallel / recurrent / chunkwise) — train like Transformer, infer like RNN. Refs: Sun et al. 2023 (RetNet, Microsoft).
Takeaway:
| Approach | Attention % | Strength | Weakness |
|---|---|---|---|
| Jamba (3:1) | 25% global | Throughput + recall | Three consecutive mamba layers weaken exact recall |
| Samba (1:1) | 50% local | Attention in every layer, bounded KV | Still attention every layer |
| Pure RetNet | 0% | Most consistent train/infer | Pure variant lags hybrids on quality |
These are three different answers to "how do we escape O(N²)?" — the right answer depends on whether you prioritize throughput, recall, or training simplicity.
Level 4: EXPERT — "Research-grade combinations + MoE × Mixer cross design"
Research-exploration territory. This level crosses two independent axes:
- Axis A — Mixer: attention / mamba / retnet / hyena / multi-mixer
- Axis B — MoE strategy: none / 1-in-4 dense+MoE / every-layer MoE / fine-grained
These axes are orthogonal: the same MoE strategy can be applied to different mixers and vice versa. The 9 expert presets are representative points in this 2D space.
MoE × Mixer cross summary
| Preset | Mixer | MoE strategy | Position |
|---|---|---|---|
arch_expert_research |
attn + mamba + retnet + hyena | 8 exp / top-2, attn layers only | Diversity × partial MoE |
arch_expert_mixtral_moe |
attn only | 8 exp / top-2, every layer | Single mixer × dense MoE |
arch_expert_striped_hyena |
hyena + attn 4:1 | None (dense FFN) | sub-O(N²) × dense |
arch_expert_blackmamba_moe |
mamba (75%) + attn (25%) | 8 exp / top-2, mamba layers only | MoE on a non-attention mixer |
arch_expert_deepseek_moe |
attn only | 32 exp / top-3 (fine-grained), every layer | Single mixer × advanced MoE |
arch_expert_retnet_moe |
retnet only | 8 exp / top-2, 1-in-4 | RetNet × MoE (no paper yet / predicted) |
arch_expert_frontier_full_moe |
mamba + hyena + retnet (no attn) | 8 exp / top-2, every layer | Most speculative |
arch_expert_progressive_stack |
hyena→mamba→retnet→attn (depth-wise) | 8 exp / top-2, only the last 4 layers | Hierarchical progression (no paper) |
arch_expert_dilated_longnet |
mamba + sw(1K→4K→16K) + global (pyramid) | 8 exp / top-2, only the last 8 global layers | Temporal pyramid (no paper) |
4-A. arch_expert_research — 4 mixers + MoE, 3-phase
All four mixers (attention/mamba/retnet/hyena) plus MoE on attention layers. Three-phase schedule: bulk → reasoning → refinement. Refs: Arora et al. 2024; Jamba-1.5.
4-B. arch_expert_mixtral_moe — Pure attention + every-layer MoE
No other mixer types — MoE density maxed out. Isolates conditional-compute effect from mixer diversity.
vs llm_*_moe: production variants use 1-in-4 MoE for realistic serving budgets.
This expert preset goes every layer for research isolation.
Refs: Jiang et al. 2024 (Mixtral 8x7B); Shazeer et al. 2017; Zoph et al. 2022.
4-C. arch_expert_striped_hyena — Hyena + Attention, 128K context
Hyena long-convolution striped with attention anchors at 128K context. No MoE — focuses on Hyena's sub-O(N²) mixing efficiency alone. Refs: Poli et al. 2023; Together AI 2023 (StripedHyena); Nguyen et al. 2023.
4-D. arch_expert_blackmamba_moe — Mamba + MoE (MoE on a non-attention mixer)
Demonstrates that MoE is not attention-exclusive. Pattern: 3 × (mamba + MoE FFN) + 1 × (attention + dense FFN) repeated 8 times. MoE lives on the O(N) mixer, yielding better expected throughput than Mixtral-style all-attention MoE.
Key insight: MoE is a choice about FFN sparsity, orthogonal to mixer choice. Most MoE papers put it on attention; this preset shows it works on Mamba too.
layer_templates:
mamba_moe:
mixer:
type: mamba
mamba: { variant: mamba2, d_state: 128 }
ffn:
type: moe
moe: { experts: 8, top_k: 2 }
state: { ssm_state: true }
Refs: Zyphra 2024 (BlackMamba); Pióro et al. 2024 (MoE-Mamba).
4-E. arch_expert_deepseek_moe — Fine-Grained MoE (32 experts, top-3)
Most advanced published MoE design. Replaces Mixtral-style (8 experts × top-2) with 32 small experts × top-3 — finer-grained specialization at the same active budget.
ffn:
type: moe
moe:
experts: 32 # fine-grained
top_k: 3 # each token → 3 experts
router: softmax
z_loss: 0.001
Mixtral vs DeepSeek:
| Aspect | Mixtral | DeepSeek |
|---|---|---|
| Experts | 8 large | 32 small |
| top_k | 2 | 3 |
| Specialization | Coarse | Fine |
| Router complexity | Lower | Higher |
Schema note: DeepSeek-V2/V3 also uses "shared experts" (always-on); EulerStack's v1 schema has no shared-expert field, so this preset approximates by raising top_k.
Refs: DeepSeek-AI 2024 (V2/V3); Dai et al. 2024 (DeepSeekMoE).
4-F. arch_expert_retnet_moe — RetNet × MoE (predicted, no paper)
This preset is an example of the new combinations EulerStack enables. No published RetNet+MoE model exists yet, but it's a logical prediction:
- Sun 2023 (RetNet): attention alternative with 3 equivalent modes.
- Pióro 2024 + Zyphra 2024 (MoE-Mamba, BlackMamba): attention-free mixer + MoE works.
- Therefore RetNet + MoE should also work (same reasoning).
Expected benefit: RetNet's chunkwise (parallel train) + recurrent (O(1) infer) composes with MoE's conditional compute for maximum long-context train/infer efficiency.
Why this is educational: Research is about filling such "logical gaps." EulerStack lets you explore them without writing any code — just edit the YAML.
Refs (composition inference): Sun 2023 + Zyphra 2024 + Pióro 2024.
4-G. arch_expert_frontier_full_moe — Attention-free, all-MoE, multi-mixer (most speculative)
The most speculative preset in EulerStack. No published paper matches this exactly, but the design follows a reasonable extrapolation of three trends:
- MoE expands capacity at fixed active FLOPs (Mixtral, DeepSeek-V3).
- Non-attention mixers match Transformer quality at O(N) cost (Mamba, Jamba, BlackMamba).
- Diverse mixers beat homogeneous stacks (Arora 2024, Jamba-1.5).
Combining all three: remove attention entirely + 3-way mixer ensemble (mamba+hyena+retnet) + MoE on every layer.
Schedule (8 groups × 4 layers = 32 layers):
# group = [mamba_moe × 2, hyena_moe × 1, retnet_moe × 1]
# → 16 mamba + 8 hyena + 8 retnet, 0 attention
Expected challenges: - Training instability: MoE + non-attention mixers both have tricky gradients. Expect more aggressive z_loss / capacity_factor / lr tuning. - Weak recall: no attention anchor → exact-match tasks likely lag attention-based models. - Excellent long-context scaling: all mixers sub-quadratic → hundreds of K tokens trainable.
Purpose: frontier exploration. Not designed to beat baselines — designed to answer "does this combination even work?" as a research starting point.
Refs (composition prediction, no single paper): BlackMamba + MoE-Mamba + Hyena + RetNet + Arora 2024.
4-H. arch_expert_progressive_stack — Hierarchical progression (no paper, predicted)
"Strictly increase mixer complexity with depth" — apply to LLMs the hierarchical division of labor seen in vision (early CNN → late attention) and some biology models. Early layers do cheap broad pattern capture; late layers do expensive precise reasoning.
Schedule (32 layers, monotonically more expensive mixers by depth):
| Zone | Layers | Mixer | Cost / role |
|---|---|---|---|
| Zone 1 | 1–8 | Hyena (dense FFN) | Cheapest (O(N log N) FFT conv) — broad structural patterns |
| Zone 2 | 9–20 | Mamba2 (dense FFN) | Moderate (O(N) selective SSM) — sequential summary of bulk tokens |
| Zone 3 | 21–28 | RetNet (dense FFN) | Moderate+ (O(N) chunkwise retention) — train/infer handoff |
| Zone 4 | 29–32 | Attention + MoE | Most expensive (O(N²) + conditional compute) — exact recall + capacity |
layer_schedule:
- { template: hyena_dense, repeat: 8 }
- { template: mamba_dense, repeat: 12 }
- { template: retnet_dense, repeat: 8 }
- { template: attn_moe, repeat: 4 }
Expected effect: Hyena quickly sketches coarse context via FFT, Mamba summarizes the narrative flow, RetNet refines in chunks, and the final 4 Attn+MoE layers handle exact recall and conditional capacity expansion at the output head.
Why this is educational: the strict "cheap → expensive" ordering is deliberate and untested. A competitor ordering (reversed, interleaved, etc.) may win instead — and with EulerStack you can swap the YAML and measure both, at the same parameter budget.
Refs (composition, no single paper): Poli 2023 (Hyena) + Gu 2024 (Mamba2) + Sun 2023 (RetNet) + Fedus 2022 (MoE) + vision-literature conv→attention hierarchies.
4-I. arch_expert_dilated_longnet — Temporal pyramid (no paper, predicted)
"Grow receptive field exponentially by layer in a pyramid" — approximate Longnet's (Ding et al. 2023, MSR) dilated attention idea with layer-wise window widening, no custom dilated-attention kernel required.
Schedule (32 layers, 5-zone pyramid, max_seq_len 64K):
| Zone | Layers | Mixer | Receptive field |
|---|---|---|---|
| Zone 1 | 1–4 | Mamba2 (dense FFN) | O(N) bulk pre-processing |
| Zone 2 | 5–8 | Sliding window 1,024 | Grammar unit (tight local) |
| Zone 3 | 9–16 | Sliding window 4,096 | Paragraph unit |
| Zone 4 | 17–24 | Sliding window 16,384 | Document unit |
| Zone 5 | 25–32 | Global attention + MoE | Global recall + conditional capacity |
layer_schedule:
- { template: mamba_prefix, repeat: 4 }
- { template: sw_1024, repeat: 4 }
- { template: sw_4096, repeat: 8 }
- { template: sw_16384, repeat: 8 }
- { template: global_moe, repeat: 8 }
embedding:
rope_theta: 1000000.0
rope_scaling: { type: linear, factor: 4.0 }
Effective receptive field grows ~4× per zone (1K → 4K → 16K → global), yielding the Longnet-style temporal pyramid without needing a specialized dilation kernel. The Mamba prefix pre-processes raw tokens in O(N), and the MoE tail adds conditional capacity at the point where global context is finally resolved.
Why this is predictable: (a) Longnet used dilated attention to scale to 1B-token context, (b) Mistral used sliding windows to cheapen attention, (c) Jamba proposed a Mamba prefix with attention anchors, (d) Fedus/Shazeer showed MoE works. This preset composes the four ideas via the simplest possible mechanism: per-zone window growth.
Refs (composition prediction): Ding 2023 (Longnet) + Beltagy 2020 (Longformer) + Jiang 2023 (Mistral) + Lieber 2024 (Jamba) + Fedus 2022 (Switch).
Takeaway (4-A through 4-I)
| Experiment | Hypothesis isolated | Research question |
|---|---|---|
| research (multi-mixer + partial MoE) | "diversity > single-mixer optimization" | Which mixer-ratio combo is best? |
| mixtral_moe (all-MoE attn) | "maximize conditional compute" | Active/total ratio → quality? |
| striped_hyena (128K) | "Hyena + little attn for extreme context" | Recall degradation without attention? |
| blackmamba_moe (mamba + MoE) | "MoE is not attention-exclusive" | Throughput gain from MoE on O(N) mixer? |
| deepseek_moe (fine-grained) | "many small experts > few large" | Specialization vs router cost tradeoff? |
| retnet_moe (predicted) | "RetNet also takes MoE" | Chunkwise + conditional compute interplay? |
| frontier_full_moe (attn-free) | "MoE + diversity without attention" | Training stability and recall limits? |
| progressive_stack (hierarchical) | "depth-wise monotonic mixer cost wins" | cheap→expensive vs reversed / interleaved? |
| dilated_longnet (temporal pyramid) | "per-zone window growth approximates dilation" | Zone count / ratios effect on quality? |
Big picture: 4-A through 4-C reproduce/slightly extend existing research. 4-D through 4-I are new combinations that EulerStack makes trivial to try — including the depth-axis (4-H) and receptive-field-axis (4-I) experiments that do not have published models. You can systematically explore the MoE × mixer × depth-structure 3D space by editing YAML.
Run the Full Comparison
A script compares all 20 arch_ presets at once — parameter counts, layer layouts,
and mixer distributions in a single table.
python examples/03_architecture_evolution.py
Next Steps
- 02 Use presets — production
llm_presets - 04 Compile and explain — HF model export
- 05 Prepare data — tokenization pipeline
- 06 Sanity train — confirm models actually learn
- 08 Expert mini walkthrough — small-scale expert presets
- 09 v1 new primitives walkthrough — YAML
syntax for
arch_advanced_mla,arch_advanced_mod,arch_expert_reasoning_r1,arch_expert_titans_memory,arch_expert_dual_stream - Mixer deep dives: attention (+ MLA), mamba, retnet, hyena