2. Mixture-of-LoRAs
Overview
mixture_lora is a strategy that replaces each FFN Linear layer with multiple LoRA experts + a router. For each token, the router selects top-k LoRA experts and computes a weighted sum. If dense_lora is "one LoRA," then mixture_lora is "E LoRAs + a selection mechanism."
- Suitable for: Multi-task learning, adaptive LoRA, handling diverse input patterns
- Compatible models: Qwen, LLaMA, Mixtral (all backbones)
- Reference preset:
configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml - Key difference from dense_lora: Linear -> MixtureLoRALinear (router + E LoRA branches)
Prerequisites
- EulerForge installation complete (see Getting Started)
- Data preprocessing complete (
data/sft_10k_raw.jsonlgenerated) - Reading the
dense_loratutorial first will help with understanding.
1. Where: Where Is It Injected?
Injected at the same locations as dense_lora. The difference is what replaces them.
Discovery Process
BackboneAdapter.find_transformer_layers(model)-- discovers transformer blocks- Within each block, searches for FFN
nn.Linearmodules matchingtarget_keywords DenseLoRAInjectionclass callsbuild_mixture_lora_for_ffn_layers()- (Optional) Attention projections get regular
LoRALinear(single LoRA, not MoE)
Target Modules
| Area | Target Keywords | Transformation Result |
|---|---|---|
| FFN | gate_proj, up_proj, down_proj |
-> MixtureLoRALinear (router + E LoRAs) |
| Attention | q_proj, v_proj |
-> LoRALinear (single LoRA, not MoE) |
Related Configuration
backbone: qwen3
injection:
strategy: mixture_lora
target_keywords: [gate_proj, up_proj, down_proj]
start_layer: 0
num_layers: 0
attn_lora:
enabled: true
keywords: [q_proj, v_proj]
2. What: What Is Injected?
Each target nn.Linear in the FFN is replaced with MixtureLoRALinear.
Transformation Process
Before: nn.Linear(in_features, out_features)
After: MixtureLoRALinear
+-- base_layer: nn.Linear (frozen, original weights)
+-- router: nn.Linear(in_features -> num_experts) <- trainable
+-- experts: [LoRABranch x num_experts] <- trainable
| +-- expert[0]: lora_A(r, in) + lora_B(out, r)
| +-- expert[1]: lora_A(r, in) + lora_B(out, r)
| +-- expert[2]: ...
| +-- expert[3]: ...
+-- scaling: alpha / r
Forward Operation
[Input x] ----------+-- base_layer(x) ------------------- base_out
|
+-- router(x) -> logits (batch, E)
| +-- softmax -> gate_prob (batch, E)
| +-- top-k selection -> weights w_k, indices idx_k
|
+-- Execute only selected experts:
delta = sum(w_k * expert[idx_k](x))
Final output = base_out + delta
Example (num_experts=4, top_k=2): For each token, 2 out of 4 experts are selected and their weighted sum is computed. Unselected experts are not executed.
Related Configuration
injection:
strategy: mixture_lora
lora_r: 48 # LoRA rank per expert
lora_alpha: 96 # Scaling factor
lora_dropout: 0.05 # LoRA dropout
num_experts: 4 # Number of LoRA experts (E)
top_k: 2 # Number of experts selected per token (K)
Parameter Guide:
- num_experts: Number of experts. More experts enable learning diverse patterns but increase memory cost. Typically 4-8.
- top_k: Number of active experts per token. Must satisfy top_k <= num_experts. Usually 1-2.
3. When: When Are Which Parameters Trained?
mixture_lora uses a 2-phase schedule. The router is warmed up first, then LoRA is trained.
Phase Configuration
training:
phases:
- step: 0 # Phase 0: Router warmup
trainable: ["router"]
- step: 2000 # Phase 1: LoRA training
trainable: ["lora", "attn_lora"]
Timeline
Step 0 ----------> Step 2000 --------------------------> Step 10000
| |
| Phase 0 | Phase 1
| [Router warmup] | [lora + attn_lora training]
| router: trainable | router: frozen
| lora: frozen | lora: trainable
| attn_lora: frozen | attn_lora: trainable
| |
| +-- Optimizer auto-rebuilt
Why 2 Phases?
-
Phase 0 (Router Warmup): The router first learns stable expert selection patterns. Each LoRA expert's
lora_Bis initialized withstd=0.01small-random-init, producing different outputs (symmetry-breaking). This allows the router to detect differences between experts and receive meaningful gradients for assigning experts to inputs. -
Phase 1 (LoRA Training): After the router has stabilized, each LoRA expert trains its parameters according to its assigned role. The router is frozen.
Phase Transition Behavior
When maybe_step() returns True at step 2000:
1. All parameters are set to requires_grad=False (replace mode)
2. Only lora and attn_lora groups are switched to requires_grad=True
3. The optimizer is rebuilt with only the new trainable parameters
4. MoE Stability Settings
The moe section is required for the mixture_lora strategy.
moe:
router_z_loss_coef: 0.001 # Router z-loss coefficient
load_balance:
type: aux_loss # Load balancing method
aux_loss_coef: 0.01 # Auxiliary loss coefficient
router_dtype: float32 # Router computation precision
Role of Each Parameter
| Parameter | Role | Recommended Value |
|---|---|---|
router_z_loss_coef |
Suppresses router logit magnitude to prevent softmax overflow (ST-MoE paper) | 0.001 |
load_balance.type |
Load balancing method across experts. aux_loss adds an auxiliary loss to prevent tokens from being concentrated on specific experts |
aux_loss |
load_balance.aux_loss_coef |
Weight of auxiliary loss. Too large degrades main task performance; too small causes load imbalance | 0.01 |
router_dtype |
Router softmax computation precision. float16/bfloat16 risk numerical instability |
float32 |
5. Full Configuration File Walkthrough
Full contents of configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml:
# -- Model Info --
device: cuda:0 # GPU device
backbone: qwen3 # [Where] Uses Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base # HuggingFace model ID
# -- Injection Settings --
injection:
strategy: mixture_lora # [What] Mixture-of-LoRAs strategy
lora_r: 48 # [What] LoRA rank per expert
lora_alpha: 96 # [What] Scaling (96/48 = 2.0)
lora_dropout: 0.05 # [What] LoRA dropout
num_experts: 4 # [What] Number of LoRA experts
top_k: 2 # [What] Active experts per token
target_keywords: [gate_proj, up_proj, down_proj] # [Where] FFN targets
start_layer: 0 # [Where] Starting layer
num_layers: 0 # [Where] 0 = all
attn_lora: # [Where] Attention LoRA (single)
enabled: true
keywords: [q_proj, v_proj]
# -- MoE Stability Settings --
moe:
router_z_loss_coef: 0.001 # z-loss: prevents logit overflow
load_balance:
type: aux_loss # Auxiliary loss-based load balancing
aux_loss_coef: 0.01 # Auxiliary loss weight
router_dtype: float32 # Router precision
# -- Training Settings --
training:
type: sft # SFT training
phases: # [When] 2-phase schedule
- step: 0 # Phase 0: Router warmup
trainable: ["router"]
- step: 2000 # Phase 1: LoRA training
trainable: ["lora", "attn_lora"]
lr: 1.0e-5
weight_decay: 0.01
warmup_steps: 200
max_train_steps: 10000 # Longer than dense_lora (5000) due to 2 phases
batch_size: 4
grad_accum_steps: 4
max_grad_norm: 1.0
log_steps: 50
save_steps: 1000
val_steps: 500
6. Checkpoint Structure
When training completes, checkpoints contain base weight (1 copy) + router + N LoRA experts per Linear. The FFN structure itself (gate_proj, up_proj, down_proj) remains unchanged.
Checkpoint structure:
+-- layer.N.mlp.gate_proj.base_layer.weight <- original weight, 1 copy (frozen, shared)
+-- layer.N.mlp.gate_proj.router.weight <- per-Linear router (trained)
+-- layer.N.mlp.gate_proj.experts.0.lora_A <- Expert 0 LoRA
+-- layer.N.mlp.gate_proj.experts.0.lora_B
+-- layer.N.mlp.gate_proj.experts.1.lora_A <- Expert 1 LoRA
+-- layer.N.mlp.gate_proj.experts.1.lora_B
+-- layer.N.mlp.gate_proj.experts.2.lora_A <- Expert 2 LoRA
+-- layer.N.mlp.gate_proj.experts.2.lora_B
+-- layer.N.mlp.gate_proj.experts.3.lora_A <- Expert 3 LoRA
+-- layer.N.mlp.gate_proj.experts.3.lora_B
+-- (up_proj, down_proj follow the same pattern)
+-- (attn_lora is single LoRA: base_layer + lora_A + lora_B)
Key Differences from moe_expert_lora
| Item | mixture_lora |
moe_expert_lora |
|---|---|---|
| base weight | 1 copy (shared by all experts) | N copies (independent copy per expert) |
| expert unit | LoRA branch (lora_A + lora_B) | Entire FFN (gate_proj + up_proj + down_proj) |
| router location | per-Linear (one per gate_proj) | per-MoEFFN (one at MLP level) |
| Memory | Small (only LoRA parameters multiplied by N) | Large (entire FFN multiplied by N) |
Bench Loading: Preserving MixtureLoRA Structure
When loading a mixture_lora checkpoint with eulerforge bench, the MixtureLoRA structure (base + router + N LoRA experts) is reconstructed as-is to preserve routing diversity.
- Reads injection parameters (num_experts, top_k, lora_r, etc.) from
resolved_config.json - Injects MixtureLoRA structure into the base model with
build_mixture_lora_for_ffn_layers() - Merges only attention LoRA (
_merge_attention_lora_only()) -- FFN MixtureLoRA keys are preserved - Loads state_dict into MixtureLoRA model -> inference with router + N LoRA expert structure
| State | Bench Behavior |
|---|---|
resolved_config.json present |
MixtureLoRA structure reconstruction -> structure-preserving inference |
resolved_config.json absent |
Fallback: expert average -> dense model (warning printed) |
Note: Existing checkpoints without
resolved_config.jsonfall back to expert delta averaging -> conversion to a dense model. In recent training runs,resolved_config.jsonis always saved.
7. Running
Basic Execution
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
--set data.format=raw \
--set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--set data.max_length=512
Configuration Overrides
# Change number of experts and top-k
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
--set data.format=raw \
--set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--set data.max_length=512 \
--set injection.num_experts=8 \
--set injection.top_k=2
Validate Configuration Only
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
--validate-only
Preflight Check
eulerforge train --preset configs/presets/qwen3.5_0.8b_mixture_lora_sft.yml \
--preflight
8. Debugging and Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
| "router_z_loss_coef is required" | moe section missing or incomplete |
Add the full moe section |
| "load_balance is required" | moe.load_balance missing |
Add load_balance.type and aux_loss_coef |
| "top_k cannot be larger than num_experts" | top_k > num_experts |
Set top_k <= num_experts |
| Only specific experts are selected (routing collapse) | aux_loss_coef too small |
Increase aux_loss_coef to 0.01-0.1 |
| "router never in any phase" warning | No phase includes the router group |
Add router to Phase 0 |
| Minimal val_loss change in Phase 0 | LoRA expert output is small (normal) | Phase 0 is the router warmup period. Significant loss decrease occurs in Phase 1 |
| OOM | Too many experts causing memory shortage | Reduce num_experts, reduce lora_r, use model.load_precision.mode: int4 |
| "N missing keys" / "unexpected keys" in bench | Attempting to merge MoE checkpoint as dense LoRA | Check strategy field in lora_info.json. Recent versions automatically reconstruct MixtureLoRA structure |
Next Steps
- To convert the entire FFN to MoE -> MoE Expert LoRA Tutorial
- For native MoE fine-tuning like Mixtral -> Native MoE Expert LoRA Tutorial
- For DPO training -> DPO Training Guide