7. Skill-Level Architecture Walkthrough

This tutorial walks through all 20 arch_*.yml presets from beginner to expert so you can see modern LLM architecture evolution as a sequence of concrete design choices. Every preset targets roughly 1–2B parameters, so differences are purely architectural rather than size effects.

Why multiple presets per level

Each level (beginner / intermediate / advanced / expert) contains competing approaches to the same problem. To understand why a given design is good, you have to compare it against its alternatives at the same level. For example:

Intermediate: Mistral 1:3 sliding vs Gemma 2 1:1 alternating vs Qwen RoPE scaling → three answers to "how to improve attention".
Advanced: Jamba 3:1 vs Samba 1:1 vs Pure RetNet → three answers to "how to escape O(N²)".

Running and comparing presets within a level is where the learning happens.

Run commands:

eulerstack presets list
eulerstack explain --preset configs/presets/arch_beginner_gpt2.yml
eulerstack validate --preset configs/presets/arch_beginner_gpt2.yml --report
eulerstack compile --preset configs/presets/arch_beginner_gpt2.yml --output-dir ./out/gpt2

Level 1: BEGINNER — "Understand the starting point"

1-A. `arch_beginner_gpt2` — Classic Transformer (2019 baseline)

The GPT-2-era choices, preserved:

Multi-Head Attention (no GQA; n_kv_heads == n_heads)
LayerNorm, post-norm (normalize after residual)
MLP + GeLU (no gating)
qkv_bias: true

Why start here? To understand what modern LLMs changed, you need to know what came before.

Refs: Vaswani et al. 2017; Radford et al. 2019 (GPT-2); Ba et al. 2016.

1-B. `arch_beginner_llama` — Modern baseline (2023 standard)

Llama 2/3 defaults:

Grouped-Query Attention (4:1)
RMSNorm, pre-norm
Gated MLP + SwiGLU
qkv_bias: false

Takeaway (1-A vs 1-B):

Axis	Classic (GPT-2)	Modern (Llama)	Benefit
Norm type/position	LayerNorm, post	RMSNorm, pre	Training stability
FFN	MLP + GeLU	Gated MLP + SwiGLU	Expressiveness
Head sharing	MHA	GQA 4:1	75% KV cache savings
Bias	Yes	No	Fewer params

Level 2: INTERMEDIATE — "Push attention harder without replacing it"

Three competing ways to scale attention cheaply:

2-A. `arch_intermediate_mistral` — Sparse global + dense local (1:3)

1-in-4 global, rest are 4096-token sliding window. ~4x KV cache savings while global reasoning is preserved periodically. Refs: Jiang et al. 2023 (Mistral 7B).

2-B. `arch_intermediate_gemma2` — Alternating global:local (1:1)

Every other layer is global. Doubles global-layer density vs Mistral — more frequent full-context reasoning, less aggressive KV savings. Refs: Team Gemma, 2024 (Gemma 2, Google DeepMind).

2-C. `arch_intermediate_qwen_longctx` — RoPE scaling for long context

Leave attention structure alone; extend positional encoding instead (rope_theta=1e6, linear scaling factor 4, max_seq_len=32K). Refs: Chen et al. 2023 (Position Interpolation); Qwen 2/3 reports.

Takeaway:

Approach	Tradeoff	Best when
Mistral (1:3)	Most KV savings	Memory-bound inference
Gemma 2 (1:1)	More global reasoning	Quality-bound tasks
Qwen (RoPE scaling)	No structural change	Extending existing weights to long context

Level 3: ADVANCED — "Replace attention, partially or fully"

3-A. `arch_advanced_jamba` — Mamba + Attention 3:1

75% Mamba2 SSM (O(N) bulk processing), 25% attention (retrieval anchor). Refs: Lieber et al. 2024 (Jamba, AI21).

3-B. `arch_advanced_samba` — Mamba + Sliding Window 1:1

Alternate Mamba with sliding-window attention. Keeps attention in every layer but makes each one cheap. Refs: Ren et al. 2024 (Samba, Microsoft).

3-C. `arch_advanced_retnet` — Pure RetNet (attention-free)

All layers use retention (exponential decay replaces softmax). Three mathematically equivalent modes (parallel / recurrent / chunkwise) — train like Transformer, infer like RNN. Refs: Sun et al. 2023 (RetNet, Microsoft).

Takeaway:

Approach	Attention %	Strength	Weakness
Jamba (3:1)	25% global	Throughput + recall	Three consecutive mamba layers weaken exact recall
Samba (1:1)	50% local	Attention in every layer, bounded KV	Still attention every layer
Pure RetNet	0%	Most consistent train/infer	Pure variant lags hybrids on quality

These are three different answers to "how do we escape O(N²)?" — the right answer depends on whether you prioritize throughput, recall, or training simplicity.

Level 4: EXPERT — "Research-grade combinations + MoE × Mixer cross design"

Research-exploration territory. This level crosses two independent axes:

Axis A — Mixer: attention / mamba / retnet / hyena / multi-mixer
Axis B — MoE strategy: none / 1-in-4 dense+MoE / every-layer MoE / fine-grained

These axes are orthogonal: the same MoE strategy can be applied to different mixers and vice versa. The 9 expert presets are representative points in this 2D space.

MoE × Mixer cross summary

Preset	Mixer	MoE strategy	Position
`arch_expert_research`	attn + mamba + retnet + hyena	8 exp / top-2, attn layers only	Diversity × partial MoE
`arch_expert_mixtral_moe`	attn only	8 exp / top-2, every layer	Single mixer × dense MoE
`arch_expert_striped_hyena`	hyena + attn 4:1	None (dense FFN)	sub-O(N²) × dense
`arch_expert_blackmamba_moe`	mamba (75%) + attn (25%)	8 exp / top-2, mamba layers only	MoE on a non-attention mixer
`arch_expert_deepseek_moe`	attn only	32 exp / top-3 (fine-grained), every layer	Single mixer × advanced MoE
`arch_expert_retnet_moe`	retnet only	8 exp / top-2, 1-in-4	RetNet × MoE (no paper yet / predicted)
`arch_expert_frontier_full_moe`	mamba + hyena + retnet (no attn)	8 exp / top-2, every layer	Most speculative
`arch_expert_progressive_stack`	hyena→mamba→retnet→attn (depth-wise)	8 exp / top-2, only the last 4 layers	Hierarchical progression (no paper)
`arch_expert_dilated_longnet`	mamba + sw(1K→4K→16K) + global (pyramid)	8 exp / top-2, only the last 8 global layers	Temporal pyramid (no paper)

4-A. `arch_expert_research` — 4 mixers + MoE, 3-phase

All four mixers (attention/mamba/retnet/hyena) plus MoE on attention layers. Three-phase schedule: bulk → reasoning → refinement. Refs: Arora et al. 2024; Jamba-1.5.

4-B. `arch_expert_mixtral_moe` — Pure attention + every-layer MoE

No other mixer types — MoE density maxed out. Isolates conditional-compute effect from mixer diversity.

vs llm_*_moe: production variants use 1-in-4 MoE for realistic serving budgets. This expert preset goes every layer for research isolation. Refs: Jiang et al. 2024 (Mixtral 8x7B); Shazeer et al. 2017; Zoph et al. 2022.

4-C. `arch_expert_striped_hyena` — Hyena + Attention, 128K context

Hyena long-convolution striped with attention anchors at 128K context. No MoE — focuses on Hyena's sub-O(N²) mixing efficiency alone. Refs: Poli et al. 2023; Together AI 2023 (StripedHyena); Nguyen et al. 2023.

4-D. `arch_expert_blackmamba_moe` — Mamba + MoE (MoE on a non-attention mixer)

Demonstrates that MoE is not attention-exclusive. Pattern: 3 × (mamba + MoE FFN) + 1 × (attention + dense FFN) repeated 8 times. MoE lives on the O(N) mixer, yielding better expected throughput than Mixtral-style all-attention MoE.

Key insight: MoE is a choice about FFN sparsity, orthogonal to mixer choice. Most MoE papers put it on attention; this preset shows it works on Mamba too.

layer_templates:
  mamba_moe:
    mixer:
      type: mamba
      mamba: { variant: mamba2, d_state: 128 }
    ffn:
      type: moe
      moe: { experts: 8, top_k: 2 }
    state: { ssm_state: true }

Refs: Zyphra 2024 (BlackMamba); Pióro et al. 2024 (MoE-Mamba).

4-E. `arch_expert_deepseek_moe` — Fine-Grained MoE (32 experts, top-3)

Most advanced published MoE design. Replaces Mixtral-style (8 experts × top-2) with 32 small experts × top-3 — finer-grained specialization at the same active budget.

ffn:
  type: moe
  moe:
    experts: 32          # fine-grained
    top_k: 3             # each token → 3 experts
    router: softmax
    z_loss: 0.001

Mixtral vs DeepSeek:

Aspect	Mixtral	DeepSeek
Experts	8 large	32 small
top_k	2	3
Specialization	Coarse	Fine
Router complexity	Lower	Higher

Schema note: DeepSeek-V2/V3 also uses "shared experts" (always-on); EulerStack's v1 schema has no shared-expert field, so this preset approximates by raising top_k.

Refs: DeepSeek-AI 2024 (V2/V3); Dai et al. 2024 (DeepSeekMoE).

4-F. `arch_expert_retnet_moe` — RetNet × MoE (predicted, no paper)

This preset is an example of the new combinations EulerStack enables. No published RetNet+MoE model exists yet, but it's a logical prediction:

Sun 2023 (RetNet): attention alternative with 3 equivalent modes.
Pióro 2024 + Zyphra 2024 (MoE-Mamba, BlackMamba): attention-free mixer + MoE works.
Therefore RetNet + MoE should also work (same reasoning).

Expected benefit: RetNet's chunkwise (parallel train) + recurrent (O(1) infer) composes with MoE's conditional compute for maximum long-context train/infer efficiency.

Why this is educational: Research is about filling such "logical gaps." EulerStack lets you explore them without writing any code — just edit the YAML.

Refs (composition inference): Sun 2023 + Zyphra 2024 + Pióro 2024.

4-G. `arch_expert_frontier_full_moe` — Attention-free, all-MoE, multi-mixer (most speculative)

The most speculative preset in EulerStack. No published paper matches this exactly, but the design follows a reasonable extrapolation of three trends:

MoE expands capacity at fixed active FLOPs (Mixtral, DeepSeek-V3).
Non-attention mixers match Transformer quality at O(N) cost (Mamba, Jamba, BlackMamba).
Diverse mixers beat homogeneous stacks (Arora 2024, Jamba-1.5).

Combining all three: remove attention entirely + 3-way mixer ensemble (mamba+hyena+retnet) + MoE on every layer.

Schedule (8 groups × 4 layers = 32 layers):

# group = [mamba_moe × 2, hyena_moe × 1, retnet_moe × 1]
# → 16 mamba + 8 hyena + 8 retnet, 0 attention

Expected challenges: - Training instability: MoE + non-attention mixers both have tricky gradients. Expect more aggressive z_loss / capacity_factor / lr tuning. - Weak recall: no attention anchor → exact-match tasks likely lag attention-based models. - Excellent long-context scaling: all mixers sub-quadratic → hundreds of K tokens trainable.

Purpose: frontier exploration. Not designed to beat baselines — designed to answer "does this combination even work?" as a research starting point.

Refs (composition prediction, no single paper): BlackMamba + MoE-Mamba + Hyena + RetNet + Arora 2024.

4-H. `arch_expert_progressive_stack` — Hierarchical progression (no paper, predicted)

"Strictly increase mixer complexity with depth" — apply to LLMs the hierarchical division of labor seen in vision (early CNN → late attention) and some biology models. Early layers do cheap broad pattern capture; late layers do expensive precise reasoning.

Schedule (32 layers, monotonically more expensive mixers by depth):

Zone	Layers	Mixer	Cost / role
Zone 1	1–8	Hyena (dense FFN)	Cheapest (O(N log N) FFT conv) — broad structural patterns
Zone 2	9–20	Mamba2 (dense FFN)	Moderate (O(N) selective SSM) — sequential summary of bulk tokens
Zone 3	21–28	RetNet (dense FFN)	Moderate+ (O(N) chunkwise retention) — train/infer handoff
Zone 4	29–32	Attention + MoE	Most expensive (O(N²) + conditional compute) — exact recall + capacity

layer_schedule:
  - { template: hyena_dense, repeat: 8 }
  - { template: mamba_dense, repeat: 12 }
  - { template: retnet_dense, repeat: 8 }
  - { template: attn_moe, repeat: 4 }

Expected effect: Hyena quickly sketches coarse context via FFT, Mamba summarizes the narrative flow, RetNet refines in chunks, and the final 4 Attn+MoE layers handle exact recall and conditional capacity expansion at the output head.

Why this is educational: the strict "cheap → expensive" ordering is deliberate and untested. A competitor ordering (reversed, interleaved, etc.) may win instead — and with EulerStack you can swap the YAML and measure both, at the same parameter budget.

Refs (composition, no single paper): Poli 2023 (Hyena) + Gu 2024 (Mamba2) + Sun 2023 (RetNet) + Fedus 2022 (MoE) + vision-literature conv→attention hierarchies.

4-I. `arch_expert_dilated_longnet` — Temporal pyramid (no paper, predicted)

"Grow receptive field exponentially by layer in a pyramid" — approximate Longnet's (Ding et al. 2023, MSR) dilated attention idea with layer-wise window widening, no custom dilated-attention kernel required.

Schedule (32 layers, 5-zone pyramid, max_seq_len 64K):

Zone	Layers	Mixer	Receptive field
Zone 1	1–4	Mamba2 (dense FFN)	O(N) bulk pre-processing
Zone 2	5–8	Sliding window 1,024	Grammar unit (tight local)
Zone 3	9–16	Sliding window 4,096	Paragraph unit
Zone 4	17–24	Sliding window 16,384	Document unit
Zone 5	25–32	Global attention + MoE	Global recall + conditional capacity

layer_schedule:
  - { template: mamba_prefix, repeat: 4 }
  - { template: sw_1024,     repeat: 4 }
  - { template: sw_4096,     repeat: 8 }
  - { template: sw_16384,    repeat: 8 }
  - { template: global_moe,  repeat: 8 }
embedding:
  rope_theta: 1000000.0
  rope_scaling: { type: linear, factor: 4.0 }

Effective receptive field grows ~4× per zone (1K → 4K → 16K → global), yielding the Longnet-style temporal pyramid without needing a specialized dilation kernel. The Mamba prefix pre-processes raw tokens in O(N), and the MoE tail adds conditional capacity at the point where global context is finally resolved.

Why this is predictable: (a) Longnet used dilated attention to scale to 1B-token context, (b) Mistral used sliding windows to cheapen attention, (c) Jamba proposed a Mamba prefix with attention anchors, (d) Fedus/Shazeer showed MoE works. This preset composes the four ideas via the simplest possible mechanism: per-zone window growth.

Refs (composition prediction): Ding 2023 (Longnet) + Beltagy 2020 (Longformer) + Jiang 2023 (Mistral) + Lieber 2024 (Jamba) + Fedus 2022 (Switch).

Takeaway (4-A through 4-I)

Experiment	Hypothesis isolated	Research question
research (multi-mixer + partial MoE)	"diversity > single-mixer optimization"	Which mixer-ratio combo is best?
mixtral_moe (all-MoE attn)	"maximize conditional compute"	Active/total ratio → quality?
striped_hyena (128K)	"Hyena + little attn for extreme context"	Recall degradation without attention?
blackmamba_moe (mamba + MoE)	"MoE is not attention-exclusive"	Throughput gain from MoE on O(N) mixer?
deepseek_moe (fine-grained)	"many small experts > few large"	Specialization vs router cost tradeoff?
retnet_moe (predicted)	"RetNet also takes MoE"	Chunkwise + conditional compute interplay?
frontier_full_moe (attn-free)	"MoE + diversity without attention"	Training stability and recall limits?
progressive_stack (hierarchical)	"depth-wise monotonic mixer cost wins"	cheap→expensive vs reversed / interleaved?
dilated_longnet (temporal pyramid)	"per-zone window growth approximates dilation"	Zone count / ratios effect on quality?

Big picture: 4-A through 4-C reproduce/slightly extend existing research. 4-D through 4-I are new combinations that EulerStack makes trivial to try — including the depth-axis (4-H) and receptive-field-axis (4-I) experiments that do not have published models. You can systematically explore the MoE × mixer × depth-structure 3D space by editing YAML.

Run the Full Comparison

A script compares all 20 arch_ presets at once — parameter counts, layer layouts, and mixer distributions in a single table.

python examples/03_architecture_evolution.py

Next Steps

02 Use presets — production llm_ presets
04 Compile and explain — HF model export
05 Prepare data — tokenization pipeline
06 Sanity train — confirm models actually learn
08 Expert mini walkthrough — small-scale expert presets
09 v1 new primitives walkthrough — YAML syntax for arch_advanced_mla, arch_advanced_mod, arch_expert_reasoning_r1, arch_expert_titans_memory, arch_expert_dual_stream
Mixer deep dives: attention (+ MLA), mamba, retnet, hyena

← Prev 6. Sanity Training Loop 8. Expert Mini Preset Walkthrough Next →

7. Skill-Level Architecture Walkthrough

Why multiple presets per level

Level 1: BEGINNER — "Understand the starting point"

1-A. arch_beginner_gpt2 — Classic Transformer (2019 baseline)

1-B. arch_beginner_llama — Modern baseline (2023 standard)

Level 2: INTERMEDIATE — "Push attention harder without replacing it"

2-A. arch_intermediate_mistral — Sparse global + dense local (1:3)

2-B. arch_intermediate_gemma2 — Alternating global:local (1:1)

2-C. arch_intermediate_qwen_longctx — RoPE scaling for long context

Level 3: ADVANCED — "Replace attention, partially or fully"

3-A. arch_advanced_jamba — Mamba + Attention 3:1

3-B. arch_advanced_samba — Mamba + Sliding Window 1:1

3-C. arch_advanced_retnet — Pure RetNet (attention-free)