Home > EulerStack > Tutorials > Mixers > 1. Attention in detail

1. Attention in detail

One-Line Summary

"Every token looks at every other token, scores their relevance, and blends them as a weighted average."

How Does It Work?

Consider the sentence "I went to Paris. There I ate __". To predict the blank, the model looks at "Paris" as the most relevant past word. Attention formalizes this mathematically:

  1. Query (Q): "what am I looking for?"
  2. Key (K): "here's what I am" (index per past token)
  3. Value (V): the actual payload
  4. score = Q·Kᵀ (dot product over all pairs)
  5. softmax → weights
  6. weights × V = new representation

All pairs means N² dot products — cost grows quadratically with sequence length.

Strengths

Weaknesses

Modern Improvements Supported by EulerStack v1

MLA usage guide: - latent_dim must satisfy 0 < latent_dim < d_model - Typical starting point: latent_dim = d_model / 2 - Biggest win at long-context (≥ 16K) serving where KV cache memory is the bottleneck - See the llm_*_mla presets (0.1B / 0.8B / 2B / 4B / 16B) and arch_advanced_mla for working examples.

Real-World Use

When Is Attention Good?

Scenario Attention quality
Short chat (≤ 4K) ★★★★★ perfect
Coding/reasoning ★★★★★ exact matching
Document QA (RAG) ★★★★★ pinpoint retrieval
Very long docs (≥32K) ★★ (KV cache + O(N²) inefficient)
Real-time streaming ★★★ (KV cache required)

EulerStack YAML

layer_templates:
  attn_layer:
    mixer:
      type: attention
      attention:
        qkv_bias: false
        attn_drop: 0.0
        window: null         # global attention
    ffn:
      type: gated_mlp
      activation: swiglu
    state:
      kv_cache: true

Sliding-window variant:

      attention:
        window: 4096         # attend to the last 4096 tokens only

GQA (in the model section):

model:
  n_heads: 16
  n_kv_heads: 4              # 16 heads → 4 KV heads (4× compression)

MLA (Phase B2.1):

layer_templates:
  mla_attn:
    mixer:
      type: attention
      attention:
        latent_dim: 768      # d_model=1536 / 2 → ~50 % KV cache savings
    ffn:
      type: gated_mlp
    state:
      kv_cache: true

MLA composes freely with GQA and sliding window (orthogonal dimensions).

Papers