0. Mixers Overview — why mix them

What is a Mixer?

An LLM boils down to "given N tokens, predict the next one." The critical question at every layer is: "which past tokens should the current token attend to?" The layer that answers this is called a token mixer.

In plain terms, a mixer is "the operation that blends information along the sequence axis" — the layer that lets earlier tokens influence later ones.

Before Transformers, RNN/LSTM played this role. Since 2017, Attention has been the de-facto standard. But Attention has a fatal flaw: O(N²) compute and memory cost. Recent research has exploded around alternative mixers that relax this cost or complement Attention.

The 6 Mixers EulerStack v1 Supports

Core four (staple of Tier 1/2 presets):

Mixer	Complexity	State	Strength
`attention`	O(N²)	KV cache	Exact recall, search, in-context learning
`mamba`	O(N)	SSM state	Long sequences, linear scaling, fast inference
`retnet`	O(N) train / O(1) infer	Chunkwise retention	Parallel training + efficient inference
`hyena`	O(N log N)	(stateless)	Very long convolution kernels, long-range deps

v1 Phase-B additions (advanced):

Mixer	Complexity	State	Strength	Runtime
`branched`	Weighted sum over branches	Per-branch	Per-token routing across sub-mixers (Jamba × per-token)	🟡 Fallback
`ttt_layer`	O(N) + inner opt step	Learnable inner MLP	Test-Time Training (Sun et al. 2024). Weights update during inference.	🟡 Fallback (Mamba path)

In addition, attention gained a Phase-B2.1 sub-setting latent_dim (MLA — Multi-head Latent Attention, DeepSeek-V3) which compresses the KV cache through a shared latent. This is not a separate mixer — it's a sub-option of attention (Core runtime).

Detailed docs:

01_attention.md — Standard Attention + MLA (latent_dim)
02_mamba.md — Mamba2 SSM
03_retnet.md — RetNet chunkwise retention
04_hyena.md — Hyena long convolution

branched and ttt_layer do not yet have dedicated mixer tutorials — see 09_new_primitives_walkthrough.md §5 and §6 for YAML syntax and usage.

Why Mix Mixers? (Hybrid Architectures)

A key 2024 finding: "Mixing mixers with different inductive biases outperforms any single mixer at the same parameter budget."

Attention excels at exact recall/search.
Mamba excels at summarizing long context flow.
RetNet excels at parallel training + RNN-like inference.
Hyena excels at very long-range patterns via a single convolution.

Jamba (AI21, 2024), Zamba (Zyphra, 2024), Samba (Microsoft, 2024) proved hybrid stacks beat pure Transformers. EulerStack's presets arch_beginner_llama → arch_expert_research walk you through this evolution.

Which Mixer When?

Use case	Recommended
Short chat (≤4K), highest quality	Attention-heavy (Llama-like)
Very long docs (≥32K), fast inference	Mamba 75% + Attention 25% (Jamba-like)
Parallel training + efficient inference	RetNet-heavy
Very long-range patterns	Hyena + Attention
Research / frontier exploration	All 4 + MoE (Stage 5)

Next Steps

Read each mixer doc starting with 01_attention.md
02_use_presets.md — see real combinations
04_compile_and_explain.md — export to HF model

1. Attention in detail Next →