1. Attention in detail

One-Line Summary

"Every token looks at every other token, scores their relevance, and blends them as a weighted average."

How Does It Work?

Consider the sentence "I went to Paris. There I ate __". To predict the blank, the model looks at "Paris" as the most relevant past word. Attention formalizes this mathematically:

Query (Q): "what am I looking for?"
Key (K): "here's what I am" (index per past token)
Value (V): the actual payload
score = Q·Kᵀ (dot product over all pairs)
softmax → weights
weights × V = new representation

All pairs means N² dot products — cost grows quadratically with sequence length.

Strengths

Perfect information access: any past token, any distance.
In-context learning (ICL): unmatched ability to learn from examples in the prompt.
Search/reasoning: exact keyword matching, strongest on such tasks.
Hardware-optimized: FlashAttention, xFormers, countless kernels.

Weaknesses

O(N²) compute/memory: 2x sequence → 4x cost.
KV cache memory: inference must store all past K,V — memory balloons at long context.
No position info: requires RoPE or similar positional encoding.

Modern Improvements Supported by EulerStack v1

GQA (Grouped-Query Attention): n_kv_heads < n_heads shrinks KV cache (Llama 2/3, Mistral, Qwen). Ainslie et al., 2023.
Sliding Window: window option restricts attention to recent N tokens (Mistral).
RoPE: rotary positional embedding, natural relative-position encoding (Su et al., 2021).
MLA (Multi-head Latent Attention, v1 Phase B2.1): latent_dim option projects K/V through a shared latent (typically d_model / 2), then up- projects per-head. Shrinks the KV cache by ~50 % with near-zero quality loss. Introduced in DeepSeek-V2/V3 (2024). Runtime is Core — forward, backward, and KV cache all work out of the box.

MLA usage guide: - latent_dim must satisfy 0 < latent_dim < d_model - Typical starting point: latent_dim = d_model / 2 - Biggest win at long-context (≥ 16K) serving where KV cache memory is the bottleneck - See the llm_*_mla presets (0.1B / 0.8B / 2B / 4B / 16B) and arch_advanced_mla for working examples.

Real-World Use

Llama 2/3 (Meta) — Attention + GQA + RoPE
Mistral 7B / Mixtral — Attention + Sliding Window + GQA
Gemma 2 (Google) — Attention + Sliding Window alternating
Qwen 2/3 (Alibaba) — Attention + GQA + RoPE

When Is Attention Good?

Scenario	Attention quality
Short chat (≤ 4K)	★★★★★ perfect
Coding/reasoning	★★★★★ exact matching
Document QA (RAG)	★★★★★ pinpoint retrieval
Very long docs (≥32K)	★★ (KV cache + O(N²) inefficient)
Real-time streaming	★★★ (KV cache required)

EulerStack YAML

layer_templates:
  attn_layer:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
        window: null         # global attention
    ffn:
      type: gated_mlp
      activation: swiglu
    state:
      kv_cache: true

Sliding-window variant:

      attention:
        window: 4096         # attend to the last 4096 tokens only

GQA (in the model section):

model:
  n_heads: 16
  n_kv_heads: 4              # 16 heads → 4 KV heads (4× compression)

MLA (Phase B2.1):

layer_templates:
  mla_attn:
    mixer:
      type: attention
      attention:
        latent_dim: 768      # d_model=1536 / 2 → ~50 % KV cache savings
    ffn:
      type: gated_mlp
    state:
      kv_cache: true

MLA composes freely with GQA and sliding window (orthogonal dimensions).

Papers

Vaswani et al., 2017. "Attention Is All You Need." NeurIPS.
Su et al., 2021. "RoFormer: Enhanced Transformer with Rotary Position Embedding."
Ainslie et al., 2023. "GQA." EMNLP.
Beltagy et al., 2020. "Longformer." (sliding window)
Dao et al., 2022. "FlashAttention." NeurIPS.
DeepSeek-AI, 2024. "DeepSeek-V2 / DeepSeek-V3 Technical Report." (MLA)

← Prev 0. Mixers Overview — why mix them 2. Mamba in detail Next →