0. Mixers Overview — why mix them
What is a Mixer?
An LLM boils down to "given N tokens, predict the next one." The critical question at every layer is: "which past tokens should the current token attend to?" The layer that answers this is called a token mixer.
In plain terms, a mixer is "the operation that blends information along the sequence axis" — the layer that lets earlier tokens influence later ones.
Before Transformers, RNN/LSTM played this role. Since 2017, Attention has been the de-facto standard. But Attention has a fatal flaw: O(N²) compute and memory cost. Recent research has exploded around alternative mixers that relax this cost or complement Attention.
The 6 Mixers EulerStack v1 Supports
Core four (staple of Tier 1/2 presets):
| Mixer | Complexity | State | Strength |
|---|---|---|---|
attention |
O(N²) | KV cache | Exact recall, search, in-context learning |
mamba |
O(N) | SSM state | Long sequences, linear scaling, fast inference |
retnet |
O(N) train / O(1) infer | Chunkwise retention | Parallel training + efficient inference |
hyena |
O(N log N) | (stateless) | Very long convolution kernels, long-range deps |
v1 Phase-B additions (advanced):
| Mixer | Complexity | State | Strength | Runtime |
|---|---|---|---|---|
branched |
Weighted sum over branches | Per-branch | Per-token routing across sub-mixers (Jamba × per-token) | 🟡 Fallback |
ttt_layer |
O(N) + inner opt step | Learnable inner MLP | Test-Time Training (Sun et al. 2024). Weights update during inference. | 🟡 Fallback (Mamba path) |
In addition, attention gained a Phase-B2.1 sub-setting
latent_dim (MLA — Multi-head Latent Attention, DeepSeek-V3) which
compresses the KV cache through a shared latent. This is not a separate
mixer — it's a sub-option of attention (Core runtime).
Detailed docs:
- 01_attention.md — Standard Attention + MLA (latent_dim)
- 02_mamba.md — Mamba2 SSM
- 03_retnet.md — RetNet chunkwise retention
- 04_hyena.md — Hyena long convolution
branched and ttt_layer do not yet have dedicated mixer tutorials —
see 09_new_primitives_walkthrough.md
§5 and §6 for YAML syntax and usage.
Why Mix Mixers? (Hybrid Architectures)
A key 2024 finding: "Mixing mixers with different inductive biases outperforms any single mixer at the same parameter budget."
- Attention excels at exact recall/search.
- Mamba excels at summarizing long context flow.
- RetNet excels at parallel training + RNN-like inference.
- Hyena excels at very long-range patterns via a single convolution.
Jamba (AI21, 2024), Zamba (Zyphra, 2024), Samba (Microsoft, 2024) proved hybrid stacks
beat pure Transformers. EulerStack's presets arch_beginner_llama → arch_expert_research walk you
through this evolution.
Which Mixer When?
| Use case | Recommended |
|---|---|
| Short chat (≤4K), highest quality | Attention-heavy (Llama-like) |
| Very long docs (≥32K), fast inference | Mamba 75% + Attention 25% (Jamba-like) |
| Parallel training + efficient inference | RetNet-heavy |
| Very long-range patterns | Hyena + Attention |
| Research / frontier exploration | All 4 + MoE (Stage 5) |
Next Steps
- Read each mixer doc starting with 01_attention.md
- 02_use_presets.md — see real combinations
- 04_compile_and_explain.md — export to HF model