2. Mamba in detail
One-Line Summary
"Maintains a dynamic state per token, sweeping left-to-right once, with O(N) cost."
How Does It Work?
Where Attention "looks at all past tokens simultaneously," Mamba "carries a small memory (state) and sweeps left to right, updating it" — like an RNN. But unlike classical RNN/LSTM, Mamba's state update rule changes dynamically per input (selective SSM). This is what makes it powerful.
Core idea:
1. State Space Model (SSM): discretize h'(t) = A·h(t) + B·x(t) to
h[k] = Ā·h[k-1] + B̄·x[k].
2. Selective: A, B depend on input x (not fixed).
3. Parallel scan: left-to-right dependency computed in parallel on GPU.
4. Hardware-aware: custom CUDA kernels optimized for SRAM.
Result: O(N) linear scaling, fixed-size state (d_state), replaces KV cache with a small per-layer state at inference.
Strengths
- Linear complexity: 2x sequence → 2x cost (not 4x).
- Tiny inference memory: state size constant; doesn't explode like KV cache.
- Very long sequences: 32K, 128K, 1M tokens scale linearly.
- Selective mechanism: remembers what matters, forgets noise.
Weaknesses
- Weaker in-context recall: compressing everything into a small state hurts "exact item lookup" — so hybrid (Mamba + Attention) is common.
- Newer kernels: less mature optimization than FlashAttention.
- Complex implementation: selective scan CUDA kernels are intricate.
Mamba1 vs Mamba2
- Mamba1 (Gu & Dao, 2023): first selective SSM.
- Mamba2 (Dao & Gu, 2024): SSD (structured state space duality) connects to attention. Larger states, multi-head, better hardware use. EulerStack default.
Real-World Use
- Mamba (Gu & Dao, 2023) — standalone SSM matching Transformer quality
- Jamba-1.5 (AI21, 2024) — Mamba + Attention + MoE at 398B/94B active
- Zamba-2 (Zyphra, 2024) — Mamba-heavy with shared attention
- Samba (Microsoft, 2024) — Mamba + Sliding Window Attention
- Falcon-Mamba (TII, 2024) — 7B pure Mamba
- Codestral Mamba (Mistral, 2024) — Mamba-based code model
When Is Mamba Good?
| Scenario | Mamba quality |
|---|---|
| Long doc summarization (≥ 32K) | ★★★★★ linear cost |
| Real-time streaming | ★★★★★ tiny state |
| Time series / DNA | ★★★★★ long sequential input |
| Coding (exact symbol match) | ★★★ (hybrid with Attention recommended) |
| Short chat (≤ 4K) | ★★★ (less benefit) |
EulerStack YAML
layer_templates:
mamba_layer:
mixer:
type: mamba
mamba:
variant: mamba2
d_state: 128
d_conv: 4
expand: 2
ffn:
type: gated_mlp
activation: swiglu
state:
ssm_state: true
Papers
- Gu & Dao, 2023. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces."
- Dao & Gu, 2024. "Transformers are SSMs." (Mamba2)
- Lieber et al., 2024. "Jamba: A Hybrid Transformer-Mamba Language Model."
- Glorioso et al., 2024. "Zamba."