4. Hyena in detail

One-Line Summary

"An FFT-based very long convolution kernel, generated implicitly by a small network, mixes sequences sub-quadratically — catches long-range patterns without attention."

How Does It Work?

Standard 1D convolution: kernel of size K lets each token mix with K neighbors. Hyena's insight: "Make the kernel size the full sequence length N so each token sees all past tokens. But don't learn N parameters directly — have a small network generate them implicitly."

Concretely:

Filter generator: a small MLP + sinusoidal features + exponential decay window takes position t and outputs filter value h(t).
FFT convolution: the long filter and input are FFT'd → elementwise multiply → inverse FFT = convolution in O(N log N).
Gating / multi-order: Hyena operator is stacked with gating for expressiveness.

Result: full-sequence long-range dependencies without Attention, at O(N log N).

Strengths

Sub-quadratic: O(N log N), faster than O(N²) Attention at large N.
Attention-free long range: no softmax, no KV cache.
Excels on very long sequences: strong in DNA, audio, music, long vision patches.
Parameter efficient: small filter generator vs learning N weights directly.

Weaknesses

Weaker ICL: can't match Attention's exact in-context matching.
Training stability: long convolutions are sensitive to init/normalization.
Kernel maturity: less general-purpose optimization than FlashAttention / Mamba.
Fewer LLM deployments: mostly a research/auxiliary mixer in LLM contexts.

Where Does It Shine?

Hyena is strongest on "non-text, extremely long sequences":

HyenaDNA (Nguyen et al., 2023): 1M nucleotide DNA without attention.
HyenaAudio, HyenaVision: long audio, long image patch sequences.
LLM hybrid auxiliary: complement attention in early layers.

In EulerStack, arch_expert_research's Phase 1 uses Hyena alongside Mamba for "bulk token processing" — early layers benefit from capturing broad structural patterns rather than exact matching.

Real-World Use

Hyena Hierarchy (Poli et al., Stanford, 2023) — original paper.
HyenaDNA (Nguyen et al., 2023) — 1M-token DNA foundation model.
StripedHyena 7B (Together AI, 2023) — Attention + Hyena hybrid 7B.
Evo (Arc Institute, 2024) — StripedHyena-based DNA foundation model.

When Is Hyena Good?

Scenario	Hyena quality
DNA / audio / long sensor data	★★★★★ original domain
LLM early layers (bulk processing)	★★★★ (in a hybrid)
Very long context (≥128K)	★★★★ linear scaling
Short chat / ICL-centric	★★ (Attention wins)
Coding (exact symbol recall)	★★ (Attention + Mamba better)

EulerStack YAML

layer_templates:
  hyena_layer:
    mixer:
      type: hyena
      hyena:
        depth: 2
        filter_hidden: 64
        filter_decay: 0.0
    ffn:
      type: gated_mlp
      activation: swiglu
    # Note: Hyena is stateless — no state section.

Stage 5 Phase 1 example (mamba + hyena):

layer_schedule:
  - template: mamba_layer
    repeat: 2
  - template: hyena_layer
    repeat: 1
  - template: mamba_layer
    repeat: 2
  - template: hyena_layer
    repeat: 1

Papers

Poli et al., 2023. "Hyena Hierarchy: Towards Larger Convolutional Language Models." ICML.
Nguyen et al., 2023. "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS.
Massaroli et al., 2023. "Laughing Hyena Distillery."
Arc Institute, 2024. "Evo: DNA foundation modeling from molecular to genome scale."

← Prev 3. RetNet in detail