2. Mixture-of-LoRAs

Overview

mixture_lora is a strategy that replaces each FFN Linear layer with multiple LoRA experts + a router. For each token, the router selects top-k LoRA experts and computes a weighted sum. If dense_lora is "one LoRA," then mixture_lora is "E LoRAs + a selection mechanism."

Suitable for: Multi-task learning, adaptive LoRA, handling diverse input patterns
Compatible models: Qwen, LLaMA, Mixtral (all backbones)
Reference preset: configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml
Key difference from dense_lora: Linear -> MixtureLoRALinear (router + E LoRA branches)

Prerequisites

EulerForge installation complete (see Getting Started)
Data preprocessing complete (data/sft_10k_raw.jsonl generated)
Reading the dense_lora tutorial first will help with understanding.

1. Where: Where Is It Injected?

Injected at the same locations as dense_lora. The difference is what replaces them.

Discovery Process

BackboneAdapter.find_transformer_layers(model) -- discovers transformer blocks
Within each block, searches for FFN nn.Linear modules matching target_keywords
DenseLoRAInjection class calls build_mixture_lora_for_ffn_layers()
(Optional) Attention projections get regular LoRALinear (single LoRA, not MoE)

Target Modules

Area	Target Keywords	Transformation Result
FFN	`gate_proj`, `up_proj`, `down_proj`	-> MixtureLoRALinear (router + E LoRAs)
Attention	`q_proj`, `v_proj`	-> LoRALinear (single LoRA, not MoE)

backbone: qwen3

injection:
  strategy: mixture_lora
  target_keywords: [gate_proj, up_proj, down_proj]
  start_layer: 0
  num_layers: 0
  attn_lora:
    enabled: true
    keywords: [q_proj, v_proj]

2. What: What Is Injected?

Each target nn.Linear in the FFN is replaced with MixtureLoRALinear.

Transformation Process

Before: nn.Linear(in_features, out_features)

After:  MixtureLoRALinear
        +-- base_layer: nn.Linear (frozen, original weights)
        +-- router: nn.Linear(in_features -> num_experts)   <- trainable
        +-- experts: [LoRABranch x num_experts]              <- trainable
        |    +-- expert[0]: lora_A(r, in) + lora_B(out, r)
        |    +-- expert[1]: lora_A(r, in) + lora_B(out, r)
        |    +-- expert[2]: ...
        |    +-- expert[3]: ...
        +-- scaling: alpha / r

Forward Operation

[Input x] ----------+-- base_layer(x) ------------------- base_out
                     |
                     +-- router(x) -> logits (batch, E)
                     |   +-- softmax -> gate_prob (batch, E)
                     |       +-- top-k selection -> weights w_k, indices idx_k
                     |
                     +-- Execute only selected experts:
                         delta = sum(w_k * expert[idx_k](x))

Final output = base_out + delta

Example (num_experts=4, top_k=2): For each token, 2 out of 4 experts are selected and their weighted sum is computed. Unselected experts are not executed.

injection:
  strategy: mixture_lora
  lora_r: 48                       # LoRA rank per expert
  lora_alpha: 96                   # Scaling factor
  lora_dropout: 0.05               # LoRA dropout
  num_experts: 4                   # Number of LoRA experts (E)
  top_k: 2                         # Number of experts selected per token (K)

Parameter Guide: - num_experts: Number of experts. More experts enable learning diverse patterns but increase memory cost. Typically 4-8. - top_k: Number of active experts per token. Must satisfy top_k <= num_experts. Usually 1-2.

3. When: When Are Which Parameters Trained?

mixture_lora uses a 2-phase schedule. The router is warmed up first, then LoRA is trained.

Phase Configuration

training:
  phases:
    - step: 0                       # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                    # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]

Timeline

Step 0 ----------> Step 2000 --------------------------> Step 10000
  |                   |
  | Phase 0           | Phase 1
  | [Router warmup]   | [lora + attn_lora training]
  | router: trainable | router: frozen
  | lora: frozen      | lora: trainable
  | attn_lora: frozen | attn_lora: trainable
  |                   |
  |                   +-- Optimizer auto-rebuilt

Why 2 Phases?

Phase 0 (Router Warmup): The router first learns stable expert selection patterns. Each LoRA expert's lora_B is initialized with std=0.01 small-random-init, producing different outputs (symmetry-breaking). This allows the router to detect differences between experts and receive meaningful gradients for assigning experts to inputs.
Phase 1 (LoRA Training): After the router has stabilized, each LoRA expert trains its parameters according to its assigned role. The router is frozen.

Phase Transition Behavior

When maybe_step() returns True at step 2000: 1. All parameters are set to requires_grad=False (replace mode) 2. Only lora and attn_lora groups are switched to requires_grad=True 3. The optimizer is rebuilt with only the new trainable parameters

4. MoE Stability Settings

The moe section is required for the mixture_lora strategy.

moe:
  router_z_loss_coef: 0.001        # Router z-loss coefficient
  load_balance:
    type: aux_loss                  # Load balancing method
    aux_loss_coef: 0.01             # Auxiliary loss coefficient
  router_dtype: float32             # Router computation precision

Role of Each Parameter

Parameter	Role	Recommended Value
`router_z_loss_coef`	Suppresses router logit magnitude to prevent softmax overflow (ST-MoE paper)	0.001
`load_balance.type`	Load balancing method across experts. `aux_loss` adds an auxiliary loss to prevent tokens from being concentrated on specific experts	`aux_loss`
`load_balance.aux_loss_coef`	Weight of auxiliary loss. Too large degrades main task performance; too small causes load imbalance	0.01
`router_dtype`	Router softmax computation precision. `float16`/`bfloat16` risk numerical instability	`float32`

5. Full Configuration File Walkthrough

Full contents of configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml:

# -- Model Info --
device: cuda:0                              # GPU device
backbone: qwen3                             # [Where] Uses Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base           # HuggingFace model ID

# -- Injection Settings --
injection:
  strategy: mixture_lora                    # [What] Mixture-of-LoRAs strategy
  lora_r: 48                                # [What] LoRA rank per expert
  lora_alpha: 96                            # [What] Scaling (96/48 = 2.0)
  lora_dropout: 0.05                        # [What] LoRA dropout
  num_experts: 4                            # [What] Number of LoRA experts
  top_k: 2                                  # [What] Active experts per token
  target_keywords: [gate_proj, up_proj, down_proj]  # [Where] FFN targets
  start_layer: 0                            # [Where] Starting layer
  num_layers: 0                             # [Where] 0 = all
  attn_lora:                                # [Where] Attention LoRA (single)
    enabled: true
    keywords: [q_proj, v_proj]

# -- MoE Stability Settings --
moe:
  router_z_loss_coef: 0.001                 # z-loss: prevents logit overflow
  load_balance:
    type: aux_loss                          # Auxiliary loss-based load balancing
    aux_loss_coef: 0.01                     # Auxiliary loss weight
  router_dtype: float32                     # Router precision

# -- Training Settings --
training:
  type: sft                                 # SFT training
  phases:                                   # [When] 2-phase schedule
    - step: 0                               # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                            # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]
  lr: 1.0e-5
  weight_decay: 0.01
  warmup_steps: 200
  max_train_steps: 10000                    # Longer than dense_lora (5000) due to 2 phases
  batch_size: 4
  grad_accum_steps: 4
  max_grad_norm: 1.0
  log_steps: 50
  save_steps: 1000
  val_steps: 500

6. Checkpoint Structure

When training completes, checkpoints contain base weight (1 copy) + router + N LoRA experts per Linear. The FFN structure itself (gate_proj, up_proj, down_proj) remains unchanged.

Checkpoint structure:
+-- layer.N.mlp.gate_proj.base_layer.weight   <- original weight, 1 copy (frozen, shared)
+-- layer.N.mlp.gate_proj.router.weight       <- per-Linear router (trained)
+-- layer.N.mlp.gate_proj.experts.0.lora_A    <- Expert 0 LoRA
+-- layer.N.mlp.gate_proj.experts.0.lora_B
+-- layer.N.mlp.gate_proj.experts.1.lora_A    <- Expert 1 LoRA
+-- layer.N.mlp.gate_proj.experts.1.lora_B
+-- layer.N.mlp.gate_proj.experts.2.lora_A    <- Expert 2 LoRA
+-- layer.N.mlp.gate_proj.experts.2.lora_B
+-- layer.N.mlp.gate_proj.experts.3.lora_A    <- Expert 3 LoRA
+-- layer.N.mlp.gate_proj.experts.3.lora_B
+-- (up_proj, down_proj follow the same pattern)
+-- (attn_lora is single LoRA: base_layer + lora_A + lora_B)

Key Differences from moe_expert_lora

Item	`mixture_lora`	`moe_expert_lora`
base weight	1 copy (shared by all experts)	N copies (independent copy per expert)
expert unit	LoRA branch (lora_A + lora_B)	Entire FFN (gate_proj + up_proj + down_proj)
router location	per-Linear (one per gate_proj)	per-MoEFFN (one at MLP level)
Memory	Small (only LoRA parameters multiplied by N)	Large (entire FFN multiplied by N)

Bench Loading: Preserving MixtureLoRA Structure

When loading a mixture_lora checkpoint with eulerforge bench, the MixtureLoRA structure (base + router + N LoRA experts) is reconstructed as-is to preserve routing diversity.

Reads injection parameters (num_experts, top_k, lora_r, etc.) from resolved_config.json
Injects MixtureLoRA structure into the base model with build_mixture_lora_for_ffn_layers()
Merges only attention LoRA (_merge_attention_lora_only()) -- FFN MixtureLoRA keys are preserved
Loads state_dict into MixtureLoRA model -> inference with router + N LoRA expert structure

State	Bench Behavior
`resolved_config.json` present	MixtureLoRA structure reconstruction -> structure-preserving inference
`resolved_config.json` absent	Fallback: expert average -> dense model (warning printed)

Note: Existing checkpoints without resolved_config.json fall back to expert delta averaging -> conversion to a dense model. In recent training runs, resolved_config.json is always saved.

7. Running

Basic Execution

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

Configuration Overrides

# Change number of experts and top-k
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512 \
    --set injection.num_experts=8 \
    --set injection.top_k=2

Validate Configuration Only

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --validate-only

Preflight Check

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --preflight

8. Debugging and Troubleshooting

Symptom	Cause	Solution
"router_z_loss_coef is required"	`moe` section missing or incomplete	Add the full `moe` section
"load_balance is required"	`moe.load_balance` missing	Add `load_balance.type` and `aux_loss_coef`
"top_k cannot be larger than num_experts"	`top_k > num_experts`	Set `top_k <= num_experts`
Only specific experts are selected (routing collapse)	`aux_loss_coef` too small	Increase `aux_loss_coef` to 0.01-0.1
"router never in any phase" warning	No phase includes the `router` group	Add `router` to Phase 0
Minimal val_loss change in Phase 0	LoRA expert output is small (normal)	Phase 0 is the router warmup period. Significant loss decrease occurs in Phase 1
OOM	Too many experts causing memory shortage	Reduce `num_experts`, reduce `lora_r`, use `model.load_precision.mode: int4`
"N missing keys" / "unexpected keys" in bench	Attempting to merge MoE checkpoint as dense LoRA	Check `strategy` field in `lora_info.json`. Recent versions automatically reconstruct MixtureLoRA structure

Next Steps

To convert the entire FFN to MoE -> MoE Expert LoRA Tutorial
For native MoE fine-tuning like Mixtral -> Native MoE Expert LoRA Tutorial
For DPO training -> DPO Training Guide

← Prev 1. Dense LoRA 3. MoE Expert LoRA (Dense → MoE) Next →