3. RetNet in detail
One-Line Summary
"Train in parallel like a Transformer, infer in O(1) like an RNN, handle long sequences in chunks — all three modes compute the same result."
How Does It Work?
Transformer vs RNN tradeoffs in one line:
- Transformer: parallel training (fast) / O(N²) inference memory (slow)
- RNN: sequential training (slow) / O(1) state update (fast)
RetNet's bet: "Do both well." The key is the retention operator — replace Attention's softmax with fixed exponential decay. The math then supports three equivalent computation modes:
- Parallel mode: like Transformer, (Q·Kᵀ)·V — used for training.
- Recurrent mode: like RNN, one state update per step — used for inference.
- Chunkwise mode: split sequence into chunks, parallel inside chunks, recurrent across chunks — training + long-context inference compromise.
Each head gets a different decay rate γ, giving multi-scale temporal resolution.
Strengths
- Parallel training: Transformer-speed training.
- O(1) inference: RNN-like constant memory + time per step.
- Chunkwise mode: parallel + recurrent hybrid for long context.
- Train/infer consistency: all three modes produce identical outputs.
- RoPE-compatible: natural positional encoding integration.
Weaknesses
- Softmax removal quality impact: original RetNet slightly underperformed Transformer at the same params (recent variants closed this gap).
- Kernel maturity: less FlashAttention-level optimization.
- Chunk size tuning:
chunk_sizeimpacts the quality/memory tradeoff.
Real-World Use
- RetNet (Sun et al., Microsoft, 2023) — original paper, 6.7B model.
- Inspiration in Jamba-1.5 — chunkwise retention concept informed hybrid designs.
- Often used as the "third axis" in research hybrid stacks alongside attention/mamba.
Attention / Mamba / RetNet Side-by-Side
| Aspect | Attention | Mamba | RetNet |
|---|---|---|---|
| Training | O(N²) | O(N) | O(N) (parallel mode) |
| Inference | O(N²) KV cache | O(1) state | O(1) state (recurrent mode) |
| Positional | External RoPE | Inside conv | RoPE + exp decay |
| Strength | Exact ICL recall | Long context summary | Train+infer both efficient |
When Is RetNet Good?
| Scenario | RetNet quality |
|---|---|
| Fast training + long context inference | ★★★★★ chunkwise mode |
| Real-time streaming (O(1) infer) | ★★★★★ recurrent mode |
| Third axis in hybrid stack | ★★★★★ complements Attn/Mamba |
| Pure RetNet for SOTA | ★★★ (hybrids usually win) |
EulerStack YAML
layer_templates:
retnet_layer:
mixer:
type: retnet
retnet:
chunkwise: true
chunk_size: 128
rope: true
ffn:
type: gated_mlp
activation: swiglu
state:
kv_cache: true
Tri-mixer (attention + mamba + retnet) schedule:
layer_schedule:
- template: mamba_layer
repeat: 1
- template: retnet_layer
repeat: 1
- template: attn_layer
repeat: 1
# ... → three inductive biases cooperating
EulerStack's arch_expert_research preset uses exactly this pattern.
Papers
- Sun et al., 2023. "Retentive Network: A Successor to Transformer for Large Language Models." Microsoft Research.
- Arora et al., 2024. "Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff." (analysis of "mixture of sequence models")