Home > EulerStack > Tutorials > Mixers > 0. Mixers Overview — why mix them

0. Mixers Overview — why mix them

What is a Mixer?

An LLM boils down to "given N tokens, predict the next one." The critical question at every layer is: "which past tokens should the current token attend to?" The layer that answers this is called a token mixer.

In plain terms, a mixer is "the operation that blends information along the sequence axis" — the layer that lets earlier tokens influence later ones.

Before Transformers, RNN/LSTM played this role. Since 2017, Attention has been the de-facto standard. But Attention has a fatal flaw: O(N²) compute and memory cost. Recent research has exploded around alternative mixers that relax this cost or complement Attention.

The 6 Mixers EulerStack v1 Supports

Core four (staple of Tier 1/2 presets):

Mixer Complexity State Strength
attention O(N²) KV cache Exact recall, search, in-context learning
mamba O(N) SSM state Long sequences, linear scaling, fast inference
retnet O(N) train / O(1) infer Chunkwise retention Parallel training + efficient inference
hyena O(N log N) (stateless) Very long convolution kernels, long-range deps

v1 Phase-B additions (advanced):

Mixer Complexity State Strength Runtime
branched Weighted sum over branches Per-branch Per-token routing across sub-mixers (Jamba × per-token) 🟡 Fallback
ttt_layer O(N) + inner opt step Learnable inner MLP Test-Time Training (Sun et al. 2024). Weights update during inference. 🟡 Fallback (Mamba path)

In addition, attention gained a Phase-B2.1 sub-setting latent_dim (MLA — Multi-head Latent Attention, DeepSeek-V3) which compresses the KV cache through a shared latent. This is not a separate mixer — it's a sub-option of attention (Core runtime).

Detailed docs:

branched and ttt_layer do not yet have dedicated mixer tutorials — see 09_new_primitives_walkthrough.md §5 and §6 for YAML syntax and usage.

Why Mix Mixers? (Hybrid Architectures)

A key 2024 finding: "Mixing mixers with different inductive biases outperforms any single mixer at the same parameter budget."

Jamba (AI21, 2024), Zamba (Zyphra, 2024), Samba (Microsoft, 2024) proved hybrid stacks beat pure Transformers. EulerStack's presets arch_beginner_llamaarch_expert_research walk you through this evolution.

Which Mixer When?

Use case Recommended
Short chat (≤4K), highest quality Attention-heavy (Llama-like)
Very long docs (≥32K), fast inference Mamba 75% + Attention 25% (Jamba-like)
Parallel training + efficient inference RetNet-heavy
Very long-range patterns Hyena + Attention
Research / frontier exploration All 4 + MoE (Stage 5)

Next Steps

  1. Read each mixer doc starting with 01_attention.md
  2. 02_use_presets.md — see real combinations
  3. 04_compile_and_explain.md — export to HF model