4. Hyena in detail
One-Line Summary
"An FFT-based very long convolution kernel, generated implicitly by a small network, mixes sequences sub-quadratically — catches long-range patterns without attention."
How Does It Work?
Standard 1D convolution: kernel of size K lets each token mix with K neighbors. Hyena's insight: "Make the kernel size the full sequence length N so each token sees all past tokens. But don't learn N parameters directly — have a small network generate them implicitly."
Concretely:
- Filter generator: a small MLP + sinusoidal features + exponential decay window takes position t and outputs filter value h(t).
- FFT convolution: the long filter and input are FFT'd → elementwise multiply → inverse FFT = convolution in O(N log N).
- Gating / multi-order: Hyena operator is stacked with gating for expressiveness.
Result: full-sequence long-range dependencies without Attention, at O(N log N).
Strengths
- Sub-quadratic: O(N log N), faster than O(N²) Attention at large N.
- Attention-free long range: no softmax, no KV cache.
- Excels on very long sequences: strong in DNA, audio, music, long vision patches.
- Parameter efficient: small filter generator vs learning N weights directly.
Weaknesses
- Weaker ICL: can't match Attention's exact in-context matching.
- Training stability: long convolutions are sensitive to init/normalization.
- Kernel maturity: less general-purpose optimization than FlashAttention / Mamba.
- Fewer LLM deployments: mostly a research/auxiliary mixer in LLM contexts.
Where Does It Shine?
Hyena is strongest on "non-text, extremely long sequences":
- HyenaDNA (Nguyen et al., 2023): 1M nucleotide DNA without attention.
- HyenaAudio, HyenaVision: long audio, long image patch sequences.
- LLM hybrid auxiliary: complement attention in early layers.
In EulerStack, arch_expert_research's Phase 1 uses Hyena alongside Mamba for
"bulk token processing" — early layers benefit from capturing broad structural
patterns rather than exact matching.
Real-World Use
- Hyena Hierarchy (Poli et al., Stanford, 2023) — original paper.
- HyenaDNA (Nguyen et al., 2023) — 1M-token DNA foundation model.
- StripedHyena 7B (Together AI, 2023) — Attention + Hyena hybrid 7B.
- Evo (Arc Institute, 2024) — StripedHyena-based DNA foundation model.
When Is Hyena Good?
| Scenario | Hyena quality |
|---|---|
| DNA / audio / long sensor data | ★★★★★ original domain |
| LLM early layers (bulk processing) | ★★★★ (in a hybrid) |
| Very long context (≥128K) | ★★★★ linear scaling |
| Short chat / ICL-centric | ★★ (Attention wins) |
| Coding (exact symbol recall) | ★★ (Attention + Mamba better) |
EulerStack YAML
layer_templates:
hyena_layer:
mixer:
type: hyena
hyena:
depth: 2
filter_hidden: 64
filter_decay: 0.0
ffn:
type: gated_mlp
activation: swiglu
# Note: Hyena is stateless — no state section.
Stage 5 Phase 1 example (mamba + hyena):
layer_schedule:
- template: mamba_layer
repeat: 2
- template: hyena_layer
repeat: 1
- template: mamba_layer
repeat: 2
- template: hyena_layer
repeat: 1
Papers
- Poli et al., 2023. "Hyena Hierarchy: Towards Larger Convolutional Language Models." ICML.
- Nguyen et al., 2023. "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." NeurIPS.
- Massaroli et al., 2023. "Laughing Hyena Distillery."
- Arc Institute, 2024. "Evo: DNA foundation modeling from molecular to genome scale."