1. Attention in detail
One-Line Summary
"Every token looks at every other token, scores their relevance, and blends them as a weighted average."
How Does It Work?
Consider the sentence "I went to Paris. There I ate __". To predict the blank, the model looks at "Paris" as the most relevant past word. Attention formalizes this mathematically:
- Query (Q): "what am I looking for?"
- Key (K): "here's what I am" (index per past token)
- Value (V): the actual payload
- score = Q·Kᵀ (dot product over all pairs)
- softmax → weights
- weights × V = new representation
All pairs means N² dot products — cost grows quadratically with sequence length.
Strengths
- Perfect information access: any past token, any distance.
- In-context learning (ICL): unmatched ability to learn from examples in the prompt.
- Search/reasoning: exact keyword matching, strongest on such tasks.
- Hardware-optimized: FlashAttention, xFormers, countless kernels.
Weaknesses
- O(N²) compute/memory: 2x sequence → 4x cost.
- KV cache memory: inference must store all past K,V — memory balloons at long context.
- No position info: requires RoPE or similar positional encoding.
Modern Improvements Supported by EulerStack v1
- GQA (Grouped-Query Attention):
n_kv_heads < n_headsshrinks KV cache (Llama 2/3, Mistral, Qwen). Ainslie et al., 2023. - Sliding Window:
windowoption restricts attention to recent N tokens (Mistral). - RoPE: rotary positional embedding, natural relative-position encoding (Su et al., 2021).
- MLA (Multi-head Latent Attention, v1 Phase B2.1):
latent_dimoption projects K/V through a shared latent (typicallyd_model / 2), then up- projects per-head. Shrinks the KV cache by ~50 % with near-zero quality loss. Introduced in DeepSeek-V2/V3 (2024). Runtime is Core — forward, backward, and KV cache all work out of the box.
MLA usage guide:
- latent_dim must satisfy 0 < latent_dim < d_model
- Typical starting point: latent_dim = d_model / 2
- Biggest win at long-context (≥ 16K) serving where KV cache memory is the
bottleneck
- See the llm_*_mla presets (0.1B / 0.8B / 2B / 4B / 16B) and
arch_advanced_mla for working examples.
Real-World Use
- Llama 2/3 (Meta) — Attention + GQA + RoPE
- Mistral 7B / Mixtral — Attention + Sliding Window + GQA
- Gemma 2 (Google) — Attention + Sliding Window alternating
- Qwen 2/3 (Alibaba) — Attention + GQA + RoPE
When Is Attention Good?
| Scenario | Attention quality |
|---|---|
| Short chat (≤ 4K) | ★★★★★ perfect |
| Coding/reasoning | ★★★★★ exact matching |
| Document QA (RAG) | ★★★★★ pinpoint retrieval |
| Very long docs (≥32K) | ★★ (KV cache + O(N²) inefficient) |
| Real-time streaming | ★★★ (KV cache required) |
EulerStack YAML
layer_templates:
attn_layer:
mixer:
type: attention
attention:
qkv_bias: false
attn_drop: 0.0
window: null # global attention
ffn:
type: gated_mlp
activation: swiglu
state:
kv_cache: true
Sliding-window variant:
attention:
window: 4096 # attend to the last 4096 tokens only
GQA (in the model section):
model:
n_heads: 16
n_kv_heads: 4 # 16 heads → 4 KV heads (4× compression)
MLA (Phase B2.1):
layer_templates:
mla_attn:
mixer:
type: attention
attention:
latent_dim: 768 # d_model=1536 / 2 → ~50 % KV cache savings
ffn:
type: gated_mlp
state:
kv_cache: true
MLA composes freely with GQA and sliding window (orthogonal dimensions).
Papers
- Vaswani et al., 2017. "Attention Is All You Need." NeurIPS.
- Su et al., 2021. "RoFormer: Enhanced Transformer with Rotary Position Embedding."
- Ainslie et al., 2023. "GQA." EMNLP.
- Beltagy et al., 2020. "Longformer." (sliding window)
- Dao et al., 2022. "FlashAttention." NeurIPS.
- DeepSeek-AI, 2024. "DeepSeek-V2 / DeepSeek-V3 Technical Report." (MLA)