2. Mamba in detail

One-Line Summary

"Maintains a dynamic state per token, sweeping left-to-right once, with O(N) cost."

How Does It Work?

Where Attention "looks at all past tokens simultaneously," Mamba "carries a small memory (state) and sweeps left to right, updating it" — like an RNN. But unlike classical RNN/LSTM, Mamba's state update rule changes dynamically per input (selective SSM). This is what makes it powerful.

Core idea: 1. State Space Model (SSM): discretize h'(t) = A·h(t) + B·x(t) to h[k] = Ā·h[k-1] + B̄·x[k]. 2. Selective: A, B depend on input x (not fixed). 3. Parallel scan: left-to-right dependency computed in parallel on GPU. 4. Hardware-aware: custom CUDA kernels optimized for SRAM.

Result: O(N) linear scaling, fixed-size state (d_state), replaces KV cache with a small per-layer state at inference.

Strengths

Linear complexity: 2x sequence → 2x cost (not 4x).
Tiny inference memory: state size constant; doesn't explode like KV cache.
Very long sequences: 32K, 128K, 1M tokens scale linearly.
Selective mechanism: remembers what matters, forgets noise.

Weaknesses

Weaker in-context recall: compressing everything into a small state hurts "exact item lookup" — so hybrid (Mamba + Attention) is common.
Newer kernels: less mature optimization than FlashAttention.
Complex implementation: selective scan CUDA kernels are intricate.

Mamba1 vs Mamba2

Mamba1 (Gu & Dao, 2023): first selective SSM.
Mamba2 (Dao & Gu, 2024): SSD (structured state space duality) connects to attention. Larger states, multi-head, better hardware use. EulerStack default.

Real-World Use

Mamba (Gu & Dao, 2023) — standalone SSM matching Transformer quality
Jamba-1.5 (AI21, 2024) — Mamba + Attention + MoE at 398B/94B active
Zamba-2 (Zyphra, 2024) — Mamba-heavy with shared attention
Samba (Microsoft, 2024) — Mamba + Sliding Window Attention
Falcon-Mamba (TII, 2024) — 7B pure Mamba
Codestral Mamba (Mistral, 2024) — Mamba-based code model

When Is Mamba Good?

Scenario	Mamba quality
Long doc summarization (≥ 32K)	★★★★★ linear cost
Real-time streaming	★★★★★ tiny state
Time series / DNA	★★★★★ long sequential input
Coding (exact symbol match)	★★★ (hybrid with Attention recommended)
Short chat (≤ 4K)	★★★ (less benefit)

EulerStack YAML

layer_templates:
  mamba_layer:
    mixer:
      type: mamba
      mamba:
        variant: mamba2
        d_state: 128
        d_conv: 4
        expand: 2
    ffn:
      type: gated_mlp
      activation: swiglu
    state:
      ssm_state: true

Papers

Gu & Dao, 2023. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces."
Dao & Gu, 2024. "Transformers are SSMs." (Mamba2)
Lieber et al., 2024. "Jamba: A Hybrid Transformer-Mamba Language Model."
Glorioso et al., 2024. "Zamba."

← Prev 1. Attention in detail 3. RetNet in detail Next →