Home > EulerStack > Tutorials > 7. Skill-Level Architecture Walkthrough

7. Skill-Level Architecture Walkthrough

This tutorial walks through all 20 arch_*.yml presets from beginner to expert so you can see modern LLM architecture evolution as a sequence of concrete design choices. Every preset targets roughly 1–2B parameters, so differences are purely architectural rather than size effects.

Why multiple presets per level

Each level (beginner / intermediate / advanced / expert) contains competing approaches to the same problem. To understand why a given design is good, you have to compare it against its alternatives at the same level. For example:

Running and comparing presets within a level is where the learning happens.

Run commands:

eulerstack presets list
eulerstack explain --preset configs/presets/arch_beginner_gpt2.yml
eulerstack validate --preset configs/presets/arch_beginner_gpt2.yml --report
eulerstack compile --preset configs/presets/arch_beginner_gpt2.yml --output-dir ./out/gpt2

Level 1: BEGINNER — "Understand the starting point"

1-A. arch_beginner_gpt2 — Classic Transformer (2019 baseline)

The GPT-2-era choices, preserved:

Why start here? To understand what modern LLMs changed, you need to know what came before.

Refs: Vaswani et al. 2017; Radford et al. 2019 (GPT-2); Ba et al. 2016.

1-B. arch_beginner_llama — Modern baseline (2023 standard)

Llama 2/3 defaults:

Takeaway (1-A vs 1-B):

Axis Classic (GPT-2) Modern (Llama) Benefit
Norm type/position LayerNorm, post RMSNorm, pre Training stability
FFN MLP + GeLU Gated MLP + SwiGLU Expressiveness
Head sharing MHA GQA 4:1 75% KV cache savings
Bias Yes No Fewer params

Level 2: INTERMEDIATE — "Push attention harder without replacing it"

Three competing ways to scale attention cheaply:

2-A. arch_intermediate_mistral — Sparse global + dense local (1:3)

1-in-4 global, rest are 4096-token sliding window. ~4x KV cache savings while global reasoning is preserved periodically. Refs: Jiang et al. 2023 (Mistral 7B).

2-B. arch_intermediate_gemma2 — Alternating global:local (1:1)

Every other layer is global. Doubles global-layer density vs Mistral — more frequent full-context reasoning, less aggressive KV savings. Refs: Team Gemma, 2024 (Gemma 2, Google DeepMind).

2-C. arch_intermediate_qwen_longctx — RoPE scaling for long context

Leave attention structure alone; extend positional encoding instead (rope_theta=1e6, linear scaling factor 4, max_seq_len=32K). Refs: Chen et al. 2023 (Position Interpolation); Qwen 2/3 reports.

Takeaway:

Approach Tradeoff Best when
Mistral (1:3) Most KV savings Memory-bound inference
Gemma 2 (1:1) More global reasoning Quality-bound tasks
Qwen (RoPE scaling) No structural change Extending existing weights to long context

Level 3: ADVANCED — "Replace attention, partially or fully"

3-A. arch_advanced_jamba — Mamba + Attention 3:1

75% Mamba2 SSM (O(N) bulk processing), 25% attention (retrieval anchor). Refs: Lieber et al. 2024 (Jamba, AI21).

3-B. arch_advanced_samba — Mamba + Sliding Window 1:1

Alternate Mamba with sliding-window attention. Keeps attention in every layer but makes each one cheap. Refs: Ren et al. 2024 (Samba, Microsoft).

3-C. arch_advanced_retnet — Pure RetNet (attention-free)

All layers use retention (exponential decay replaces softmax). Three mathematically equivalent modes (parallel / recurrent / chunkwise) — train like Transformer, infer like RNN. Refs: Sun et al. 2023 (RetNet, Microsoft).

Takeaway:

Approach Attention % Strength Weakness
Jamba (3:1) 25% global Throughput + recall Three consecutive mamba layers weaken exact recall
Samba (1:1) 50% local Attention in every layer, bounded KV Still attention every layer
Pure RetNet 0% Most consistent train/infer Pure variant lags hybrids on quality

These are three different answers to "how do we escape O(N²)?" — the right answer depends on whether you prioritize throughput, recall, or training simplicity.


Level 4: EXPERT — "Research-grade combinations + MoE × Mixer cross design"

Research-exploration territory. This level crosses two independent axes:

These axes are orthogonal: the same MoE strategy can be applied to different mixers and vice versa. The 9 expert presets are representative points in this 2D space.

MoE × Mixer cross summary

Preset Mixer MoE strategy Position
arch_expert_research attn + mamba + retnet + hyena 8 exp / top-2, attn layers only Diversity × partial MoE
arch_expert_mixtral_moe attn only 8 exp / top-2, every layer Single mixer × dense MoE
arch_expert_striped_hyena hyena + attn 4:1 None (dense FFN) sub-O(N²) × dense
arch_expert_blackmamba_moe mamba (75%) + attn (25%) 8 exp / top-2, mamba layers only MoE on a non-attention mixer
arch_expert_deepseek_moe attn only 32 exp / top-3 (fine-grained), every layer Single mixer × advanced MoE
arch_expert_retnet_moe retnet only 8 exp / top-2, 1-in-4 RetNet × MoE (no paper yet / predicted)
arch_expert_frontier_full_moe mamba + hyena + retnet (no attn) 8 exp / top-2, every layer Most speculative
arch_expert_progressive_stack hyena→mamba→retnet→attn (depth-wise) 8 exp / top-2, only the last 4 layers Hierarchical progression (no paper)
arch_expert_dilated_longnet mamba + sw(1K→4K→16K) + global (pyramid) 8 exp / top-2, only the last 8 global layers Temporal pyramid (no paper)

4-A. arch_expert_research — 4 mixers + MoE, 3-phase

All four mixers (attention/mamba/retnet/hyena) plus MoE on attention layers. Three-phase schedule: bulk → reasoning → refinement. Refs: Arora et al. 2024; Jamba-1.5.

4-B. arch_expert_mixtral_moe — Pure attention + every-layer MoE

No other mixer types — MoE density maxed out. Isolates conditional-compute effect from mixer diversity.

vs llm_*_moe: production variants use 1-in-4 MoE for realistic serving budgets. This expert preset goes every layer for research isolation. Refs: Jiang et al. 2024 (Mixtral 8x7B); Shazeer et al. 2017; Zoph et al. 2022.

4-C. arch_expert_striped_hyena — Hyena + Attention, 128K context

Hyena long-convolution striped with attention anchors at 128K context. No MoE — focuses on Hyena's sub-O(N²) mixing efficiency alone. Refs: Poli et al. 2023; Together AI 2023 (StripedHyena); Nguyen et al. 2023.

4-D. arch_expert_blackmamba_moe — Mamba + MoE (MoE on a non-attention mixer)

Demonstrates that MoE is not attention-exclusive. Pattern: 3 × (mamba + MoE FFN) + 1 × (attention + dense FFN) repeated 8 times. MoE lives on the O(N) mixer, yielding better expected throughput than Mixtral-style all-attention MoE.

Key insight: MoE is a choice about FFN sparsity, orthogonal to mixer choice. Most MoE papers put it on attention; this preset shows it works on Mamba too.

layer_templates:
  mamba_moe:
    mixer:
      type: mamba
      mamba: { variant: mamba2, d_state: 128 }
    ffn:
      type: moe
      moe: { experts: 8, top_k: 2 }
    state: { ssm_state: true }

Refs: Zyphra 2024 (BlackMamba); Pióro et al. 2024 (MoE-Mamba).

4-E. arch_expert_deepseek_moe — Fine-Grained MoE (32 experts, top-3)

Most advanced published MoE design. Replaces Mixtral-style (8 experts × top-2) with 32 small experts × top-3 — finer-grained specialization at the same active budget.

ffn:
  type: moe
  moe:
    experts: 32          # fine-grained
    top_k: 3             # each token → 3 experts
    router: softmax
    z_loss: 0.001

Mixtral vs DeepSeek:

Aspect Mixtral DeepSeek
Experts 8 large 32 small
top_k 2 3
Specialization Coarse Fine
Router complexity Lower Higher

Schema note: DeepSeek-V2/V3 also uses "shared experts" (always-on); EulerStack's v1 schema has no shared-expert field, so this preset approximates by raising top_k.

Refs: DeepSeek-AI 2024 (V2/V3); Dai et al. 2024 (DeepSeekMoE).

4-F. arch_expert_retnet_moe — RetNet × MoE (predicted, no paper)

This preset is an example of the new combinations EulerStack enables. No published RetNet+MoE model exists yet, but it's a logical prediction:

  1. Sun 2023 (RetNet): attention alternative with 3 equivalent modes.
  2. Pióro 2024 + Zyphra 2024 (MoE-Mamba, BlackMamba): attention-free mixer + MoE works.
  3. Therefore RetNet + MoE should also work (same reasoning).

Expected benefit: RetNet's chunkwise (parallel train) + recurrent (O(1) infer) composes with MoE's conditional compute for maximum long-context train/infer efficiency.

Why this is educational: Research is about filling such "logical gaps." EulerStack lets you explore them without writing any code — just edit the YAML.

Refs (composition inference): Sun 2023 + Zyphra 2024 + Pióro 2024.

4-G. arch_expert_frontier_full_moe — Attention-free, all-MoE, multi-mixer (most speculative)

The most speculative preset in EulerStack. No published paper matches this exactly, but the design follows a reasonable extrapolation of three trends:

  1. MoE expands capacity at fixed active FLOPs (Mixtral, DeepSeek-V3).
  2. Non-attention mixers match Transformer quality at O(N) cost (Mamba, Jamba, BlackMamba).
  3. Diverse mixers beat homogeneous stacks (Arora 2024, Jamba-1.5).

Combining all three: remove attention entirely + 3-way mixer ensemble (mamba+hyena+retnet) + MoE on every layer.

Schedule (8 groups × 4 layers = 32 layers):

# group = [mamba_moe × 2, hyena_moe × 1, retnet_moe × 1]
# → 16 mamba + 8 hyena + 8 retnet, 0 attention

Expected challenges: - Training instability: MoE + non-attention mixers both have tricky gradients. Expect more aggressive z_loss / capacity_factor / lr tuning. - Weak recall: no attention anchor → exact-match tasks likely lag attention-based models. - Excellent long-context scaling: all mixers sub-quadratic → hundreds of K tokens trainable.

Purpose: frontier exploration. Not designed to beat baselines — designed to answer "does this combination even work?" as a research starting point.

Refs (composition prediction, no single paper): BlackMamba + MoE-Mamba + Hyena + RetNet + Arora 2024.

4-H. arch_expert_progressive_stack — Hierarchical progression (no paper, predicted)

"Strictly increase mixer complexity with depth" — apply to LLMs the hierarchical division of labor seen in vision (early CNN → late attention) and some biology models. Early layers do cheap broad pattern capture; late layers do expensive precise reasoning.

Schedule (32 layers, monotonically more expensive mixers by depth):

Zone Layers Mixer Cost / role
Zone 1 1–8 Hyena (dense FFN) Cheapest (O(N log N) FFT conv) — broad structural patterns
Zone 2 9–20 Mamba2 (dense FFN) Moderate (O(N) selective SSM) — sequential summary of bulk tokens
Zone 3 21–28 RetNet (dense FFN) Moderate+ (O(N) chunkwise retention) — train/infer handoff
Zone 4 29–32 Attention + MoE Most expensive (O(N²) + conditional compute) — exact recall + capacity
layer_schedule:
  - { template: hyena_dense, repeat: 8 }
  - { template: mamba_dense, repeat: 12 }
  - { template: retnet_dense, repeat: 8 }
  - { template: attn_moe, repeat: 4 }

Expected effect: Hyena quickly sketches coarse context via FFT, Mamba summarizes the narrative flow, RetNet refines in chunks, and the final 4 Attn+MoE layers handle exact recall and conditional capacity expansion at the output head.

Why this is educational: the strict "cheap → expensive" ordering is deliberate and untested. A competitor ordering (reversed, interleaved, etc.) may win instead — and with EulerStack you can swap the YAML and measure both, at the same parameter budget.

Refs (composition, no single paper): Poli 2023 (Hyena) + Gu 2024 (Mamba2) + Sun 2023 (RetNet) + Fedus 2022 (MoE) + vision-literature conv→attention hierarchies.

4-I. arch_expert_dilated_longnet — Temporal pyramid (no paper, predicted)

"Grow receptive field exponentially by layer in a pyramid" — approximate Longnet's (Ding et al. 2023, MSR) dilated attention idea with layer-wise window widening, no custom dilated-attention kernel required.

Schedule (32 layers, 5-zone pyramid, max_seq_len 64K):

Zone Layers Mixer Receptive field
Zone 1 1–4 Mamba2 (dense FFN) O(N) bulk pre-processing
Zone 2 5–8 Sliding window 1,024 Grammar unit (tight local)
Zone 3 9–16 Sliding window 4,096 Paragraph unit
Zone 4 17–24 Sliding window 16,384 Document unit
Zone 5 25–32 Global attention + MoE Global recall + conditional capacity
layer_schedule:
  - { template: mamba_prefix, repeat: 4 }
  - { template: sw_1024,     repeat: 4 }
  - { template: sw_4096,     repeat: 8 }
  - { template: sw_16384,    repeat: 8 }
  - { template: global_moe,  repeat: 8 }
embedding:
  rope_theta: 1000000.0
  rope_scaling: { type: linear, factor: 4.0 }

Effective receptive field grows ~4× per zone (1K → 4K → 16K → global), yielding the Longnet-style temporal pyramid without needing a specialized dilation kernel. The Mamba prefix pre-processes raw tokens in O(N), and the MoE tail adds conditional capacity at the point where global context is finally resolved.

Why this is predictable: (a) Longnet used dilated attention to scale to 1B-token context, (b) Mistral used sliding windows to cheapen attention, (c) Jamba proposed a Mamba prefix with attention anchors, (d) Fedus/Shazeer showed MoE works. This preset composes the four ideas via the simplest possible mechanism: per-zone window growth.

Refs (composition prediction): Ding 2023 (Longnet) + Beltagy 2020 (Longformer) + Jiang 2023 (Mistral) + Lieber 2024 (Jamba) + Fedus 2022 (Switch).


Takeaway (4-A through 4-I)

Experiment Hypothesis isolated Research question
research (multi-mixer + partial MoE) "diversity > single-mixer optimization" Which mixer-ratio combo is best?
mixtral_moe (all-MoE attn) "maximize conditional compute" Active/total ratio → quality?
striped_hyena (128K) "Hyena + little attn for extreme context" Recall degradation without attention?
blackmamba_moe (mamba + MoE) "MoE is not attention-exclusive" Throughput gain from MoE on O(N) mixer?
deepseek_moe (fine-grained) "many small experts > few large" Specialization vs router cost tradeoff?
retnet_moe (predicted) "RetNet also takes MoE" Chunkwise + conditional compute interplay?
frontier_full_moe (attn-free) "MoE + diversity without attention" Training stability and recall limits?
progressive_stack (hierarchical) "depth-wise monotonic mixer cost wins" cheap→expensive vs reversed / interleaved?
dilated_longnet (temporal pyramid) "per-zone window growth approximates dilation" Zone count / ratios effect on quality?

Big picture: 4-A through 4-C reproduce/slightly extend existing research. 4-D through 4-I are new combinations that EulerStack makes trivial to try — including the depth-axis (4-H) and receptive-field-axis (4-I) experiments that do not have published models. You can systematically explore the MoE × mixer × depth-structure 3D space by editing YAML.


Run the Full Comparison

A script compares all 20 arch_ presets at once — parameter counts, layer layouts, and mixer distributions in a single table.

python examples/03_architecture_evolution.py

Next Steps