Home > EulerStack > Tutorials > Mixers > 3. RetNet in detail

3. RetNet in detail

One-Line Summary

"Train in parallel like a Transformer, infer in O(1) like an RNN, handle long sequences in chunks — all three modes compute the same result."

How Does It Work?

Transformer vs RNN tradeoffs in one line:

RetNet's bet: "Do both well." The key is the retention operator — replace Attention's softmax with fixed exponential decay. The math then supports three equivalent computation modes:

  1. Parallel mode: like Transformer, (Q·Kᵀ)·V — used for training.
  2. Recurrent mode: like RNN, one state update per step — used for inference.
  3. Chunkwise mode: split sequence into chunks, parallel inside chunks, recurrent across chunks — training + long-context inference compromise.

Each head gets a different decay rate γ, giving multi-scale temporal resolution.

Strengths

Weaknesses

Real-World Use

Attention / Mamba / RetNet Side-by-Side

Aspect Attention Mamba RetNet
Training O(N²) O(N) O(N) (parallel mode)
Inference O(N²) KV cache O(1) state O(1) state (recurrent mode)
Positional External RoPE Inside conv RoPE + exp decay
Strength Exact ICL recall Long context summary Train+infer both efficient

When Is RetNet Good?

Scenario RetNet quality
Fast training + long context inference ★★★★★ chunkwise mode
Real-time streaming (O(1) infer) ★★★★★ recurrent mode
Third axis in hybrid stack ★★★★★ complements Attn/Mamba
Pure RetNet for SOTA ★★★ (hybrids usually win)

EulerStack YAML

layer_templates:
  retnet_layer:
    mixer:
      type: retnet
      retnet:
        chunkwise: true
        chunk_size: 128
        rope: true
    ffn:
      type: gated_mlp
      activation: swiglu
    state:
      kv_cache: true

Tri-mixer (attention + mamba + retnet) schedule:

layer_schedule:
  - template: mamba_layer
    repeat: 1
  - template: retnet_layer
    repeat: 1
  - template: attn_layer
    repeat: 1
  # ... → three inductive biases cooperating

EulerStack's arch_expert_research preset uses exactly this pattern.

Papers