3. RetNet in detail

One-Line Summary

"Train in parallel like a Transformer, infer in O(1) like an RNN, handle long sequences in chunks — all three modes compute the same result."

How Does It Work?

Transformer vs RNN tradeoffs in one line:

Transformer: parallel training (fast) / O(N²) inference memory (slow)
RNN: sequential training (slow) / O(1) state update (fast)

RetNet's bet: "Do both well." The key is the retention operator — replace Attention's softmax with fixed exponential decay. The math then supports three equivalent computation modes:

Parallel mode: like Transformer, (Q·Kᵀ)·V — used for training.
Recurrent mode: like RNN, one state update per step — used for inference.
Chunkwise mode: split sequence into chunks, parallel inside chunks, recurrent across chunks — training + long-context inference compromise.

Each head gets a different decay rate γ, giving multi-scale temporal resolution.

Strengths

Parallel training: Transformer-speed training.
O(1) inference: RNN-like constant memory + time per step.
Chunkwise mode: parallel + recurrent hybrid for long context.
Train/infer consistency: all three modes produce identical outputs.
RoPE-compatible: natural positional encoding integration.

Weaknesses

Softmax removal quality impact: original RetNet slightly underperformed Transformer at the same params (recent variants closed this gap).
Kernel maturity: less FlashAttention-level optimization.
Chunk size tuning: chunk_size impacts the quality/memory tradeoff.

Real-World Use

RetNet (Sun et al., Microsoft, 2023) — original paper, 6.7B model.
Inspiration in Jamba-1.5 — chunkwise retention concept informed hybrid designs.
Often used as the "third axis" in research hybrid stacks alongside attention/mamba.

Attention / Mamba / RetNet Side-by-Side

Aspect	Attention	Mamba	RetNet
Training	O(N²)	O(N)	O(N) (parallel mode)
Inference	O(N²) KV cache	O(1) state	O(1) state (recurrent mode)
Positional	External RoPE	Inside conv	RoPE + exp decay
Strength	Exact ICL recall	Long context summary	Train+infer both efficient

When Is RetNet Good?

Scenario	RetNet quality
Fast training + long context inference	★★★★★ chunkwise mode
Real-time streaming (O(1) infer)	★★★★★ recurrent mode
Third axis in hybrid stack	★★★★★ complements Attn/Mamba
Pure RetNet for SOTA	★★★ (hybrids usually win)

EulerStack YAML

layer_templates:
  retnet_layer:
    mixer:
      type: retnet
      retnet:
        chunkwise: true
        chunk_size: 128
        rope: true
    ffn:
      type: gated_mlp
      activation: swiglu
    state:
      kv_cache: true

Tri-mixer (attention + mamba + retnet) schedule:

layer_schedule:
  - template: mamba_layer
    repeat: 1
  - template: retnet_layer
    repeat: 1
  - template: attn_layer
    repeat: 1
  # ... → three inductive biases cooperating

EulerStack's arch_expert_research preset uses exactly this pattern.

Papers

Sun et al., 2023. "Retentive Network: A Successor to Transformer for Large Language Models." Microsoft Research.
Arora et al., 2024. "Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff." (analysis of "mixture of sequence models")

← Prev 2. Mamba in detail 4. Hyena in detail Next →