Home > EulerForge > Tutorials > 2. Mixture-of-LoRAs

2. Mixture-of-LoRAs

Overview

mixture_lora is a strategy that replaces each FFN Linear layer with multiple LoRA experts + a router. For each token, the router selects top-k LoRA experts and computes a weighted sum. If dense_lora is "one LoRA," then mixture_lora is "E LoRAs + a selection mechanism."


Prerequisites


1. Where: Where Is It Injected?

Injected at the same locations as dense_lora. The difference is what replaces them.

Discovery Process

  1. BackboneAdapter.find_transformer_layers(model) -- discovers transformer blocks
  2. Within each block, searches for FFN nn.Linear modules matching target_keywords
  3. DenseLoRAInjection class calls build_mixture_lora_for_ffn_layers()
  4. (Optional) Attention projections get regular LoRALinear (single LoRA, not MoE)

Target Modules

Area Target Keywords Transformation Result
FFN gate_proj, up_proj, down_proj -> MixtureLoRALinear (router + E LoRAs)
Attention q_proj, v_proj -> LoRALinear (single LoRA, not MoE)
backbone: qwen3

injection:
  strategy: mixture_lora
  target_keywords: [gate_proj, up_proj, down_proj]
  start_layer: 0
  num_layers: 0
  attn_lora:
    enabled: true
    keywords: [q_proj, v_proj]

2. What: What Is Injected?

Each target nn.Linear in the FFN is replaced with MixtureLoRALinear.

Transformation Process

Before: nn.Linear(in_features, out_features)

After:  MixtureLoRALinear
        +-- base_layer: nn.Linear (frozen, original weights)
        +-- router: nn.Linear(in_features -> num_experts)   <- trainable
        +-- experts: [LoRABranch x num_experts]              <- trainable
        |    +-- expert[0]: lora_A(r, in) + lora_B(out, r)
        |    +-- expert[1]: lora_A(r, in) + lora_B(out, r)
        |    +-- expert[2]: ...
        |    +-- expert[3]: ...
        +-- scaling: alpha / r

Forward Operation

[Input x] ----------+-- base_layer(x) ------------------- base_out
                     |
                     +-- router(x) -> logits (batch, E)
                     |   +-- softmax -> gate_prob (batch, E)
                     |       +-- top-k selection -> weights w_k, indices idx_k
                     |
                     +-- Execute only selected experts:
                         delta = sum(w_k * expert[idx_k](x))

Final output = base_out + delta

Example (num_experts=4, top_k=2): For each token, 2 out of 4 experts are selected and their weighted sum is computed. Unselected experts are not executed.

injection:
  strategy: mixture_lora
  lora_r: 48                       # LoRA rank per expert
  lora_alpha: 96                   # Scaling factor
  lora_dropout: 0.05               # LoRA dropout
  num_experts: 4                   # Number of LoRA experts (E)
  top_k: 2                         # Number of experts selected per token (K)

Parameter Guide: - num_experts: Number of experts. More experts enable learning diverse patterns but increase memory cost. Typically 4-8. - top_k: Number of active experts per token. Must satisfy top_k <= num_experts. Usually 1-2.


3. When: When Are Which Parameters Trained?

mixture_lora uses a 2-phase schedule. The router is warmed up first, then LoRA is trained.

Phase Configuration

training:
  phases:
    - step: 0                       # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                    # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]

Timeline

Step 0 ----------> Step 2000 --------------------------> Step 10000
  |                   |
  | Phase 0           | Phase 1
  | [Router warmup]   | [lora + attn_lora training]
  | router: trainable | router: frozen
  | lora: frozen      | lora: trainable
  | attn_lora: frozen | attn_lora: trainable
  |                   |
  |                   +-- Optimizer auto-rebuilt

Why 2 Phases?

  1. Phase 0 (Router Warmup): The router first learns stable expert selection patterns. Each LoRA expert's lora_B is initialized with std=0.01 small-random-init, producing different outputs (symmetry-breaking). This allows the router to detect differences between experts and receive meaningful gradients for assigning experts to inputs.

  2. Phase 1 (LoRA Training): After the router has stabilized, each LoRA expert trains its parameters according to its assigned role. The router is frozen.

Phase Transition Behavior

When maybe_step() returns True at step 2000: 1. All parameters are set to requires_grad=False (replace mode) 2. Only lora and attn_lora groups are switched to requires_grad=True 3. The optimizer is rebuilt with only the new trainable parameters


4. MoE Stability Settings

The moe section is required for the mixture_lora strategy.

moe:
  router_z_loss_coef: 0.001        # Router z-loss coefficient
  load_balance:
    type: aux_loss                  # Load balancing method
    aux_loss_coef: 0.01             # Auxiliary loss coefficient
  router_dtype: float32             # Router computation precision

Role of Each Parameter

Parameter Role Recommended Value
router_z_loss_coef Suppresses router logit magnitude to prevent softmax overflow (ST-MoE paper) 0.001
load_balance.type Load balancing method across experts. aux_loss adds an auxiliary loss to prevent tokens from being concentrated on specific experts aux_loss
load_balance.aux_loss_coef Weight of auxiliary loss. Too large degrades main task performance; too small causes load imbalance 0.01
router_dtype Router softmax computation precision. float16/bfloat16 risk numerical instability float32

5. Full Configuration File Walkthrough

Full contents of configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml:

# -- Model Info --
device: cuda:0                              # GPU device
backbone: qwen3                             # [Where] Uses Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base           # HuggingFace model ID

# -- Injection Settings --
injection:
  strategy: mixture_lora                    # [What] Mixture-of-LoRAs strategy
  lora_r: 48                                # [What] LoRA rank per expert
  lora_alpha: 96                            # [What] Scaling (96/48 = 2.0)
  lora_dropout: 0.05                        # [What] LoRA dropout
  num_experts: 4                            # [What] Number of LoRA experts
  top_k: 2                                  # [What] Active experts per token
  target_keywords: [gate_proj, up_proj, down_proj]  # [Where] FFN targets
  start_layer: 0                            # [Where] Starting layer
  num_layers: 0                             # [Where] 0 = all
  attn_lora:                                # [Where] Attention LoRA (single)
    enabled: true
    keywords: [q_proj, v_proj]

# -- MoE Stability Settings --
moe:
  router_z_loss_coef: 0.001                 # z-loss: prevents logit overflow
  load_balance:
    type: aux_loss                          # Auxiliary loss-based load balancing
    aux_loss_coef: 0.01                     # Auxiliary loss weight
  router_dtype: float32                     # Router precision

# -- Training Settings --
training:
  type: sft                                 # SFT training
  phases:                                   # [When] 2-phase schedule
    - step: 0                               # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                            # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]
  lr: 1.0e-5
  weight_decay: 0.01
  warmup_steps: 200
  max_train_steps: 10000                    # Longer than dense_lora (5000) due to 2 phases
  batch_size: 4
  grad_accum_steps: 4
  max_grad_norm: 1.0
  log_steps: 50
  save_steps: 1000
  val_steps: 500

6. Checkpoint Structure

When training completes, checkpoints contain base weight (1 copy) + router + N LoRA experts per Linear. The FFN structure itself (gate_proj, up_proj, down_proj) remains unchanged.

Checkpoint structure:
+-- layer.N.mlp.gate_proj.base_layer.weight   <- original weight, 1 copy (frozen, shared)
+-- layer.N.mlp.gate_proj.router.weight       <- per-Linear router (trained)
+-- layer.N.mlp.gate_proj.experts.0.lora_A    <- Expert 0 LoRA
+-- layer.N.mlp.gate_proj.experts.0.lora_B
+-- layer.N.mlp.gate_proj.experts.1.lora_A    <- Expert 1 LoRA
+-- layer.N.mlp.gate_proj.experts.1.lora_B
+-- layer.N.mlp.gate_proj.experts.2.lora_A    <- Expert 2 LoRA
+-- layer.N.mlp.gate_proj.experts.2.lora_B
+-- layer.N.mlp.gate_proj.experts.3.lora_A    <- Expert 3 LoRA
+-- layer.N.mlp.gate_proj.experts.3.lora_B
+-- (up_proj, down_proj follow the same pattern)
+-- (attn_lora is single LoRA: base_layer + lora_A + lora_B)

Key Differences from moe_expert_lora

Item mixture_lora moe_expert_lora
base weight 1 copy (shared by all experts) N copies (independent copy per expert)
expert unit LoRA branch (lora_A + lora_B) Entire FFN (gate_proj + up_proj + down_proj)
router location per-Linear (one per gate_proj) per-MoEFFN (one at MLP level)
Memory Small (only LoRA parameters multiplied by N) Large (entire FFN multiplied by N)

Bench Loading: Preserving MixtureLoRA Structure

When loading a mixture_lora checkpoint with eulerforge bench, the MixtureLoRA structure (base + router + N LoRA experts) is reconstructed as-is to preserve routing diversity.

  1. Reads injection parameters (num_experts, top_k, lora_r, etc.) from resolved_config.json
  2. Injects MixtureLoRA structure into the base model with build_mixture_lora_for_ffn_layers()
  3. Merges only attention LoRA (_merge_attention_lora_only()) -- FFN MixtureLoRA keys are preserved
  4. Loads state_dict into MixtureLoRA model -> inference with router + N LoRA expert structure
State Bench Behavior
resolved_config.json present MixtureLoRA structure reconstruction -> structure-preserving inference
resolved_config.json absent Fallback: expert average -> dense model (warning printed)

Note: Existing checkpoints without resolved_config.json fall back to expert delta averaging -> conversion to a dense model. In recent training runs, resolved_config.json is always saved.


7. Running

Basic Execution

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

Configuration Overrides

# Change number of experts and top-k
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512 \
    --set injection.num_experts=8 \
    --set injection.top_k=2

Validate Configuration Only

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --validate-only

Preflight Check

eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
    --preflight

8. Debugging and Troubleshooting

Symptom Cause Solution
"router_z_loss_coef is required" moe section missing or incomplete Add the full moe section
"load_balance is required" moe.load_balance missing Add load_balance.type and aux_loss_coef
"top_k cannot be larger than num_experts" top_k > num_experts Set top_k <= num_experts
Only specific experts are selected (routing collapse) aux_loss_coef too small Increase aux_loss_coef to 0.01-0.1
"router never in any phase" warning No phase includes the router group Add router to Phase 0
Minimal val_loss change in Phase 0 LoRA expert output is small (normal) Phase 0 is the router warmup period. Significant loss decrease occurs in Phase 1
OOM Too many experts causing memory shortage Reduce num_experts, reduce lora_r, use model.load_precision.mode: int4
"N missing keys" / "unexpected keys" in bench Attempting to merge MoE checkpoint as dense LoRA Check strategy field in lora_info.json. Recent versions automatically reconstruct MixtureLoRA structure

Next Steps