14. LoRA Handoff Scheduling
Prerequisites: Completed Tutorial 03 (FFN MoE Expert LoRA)
Research / Advanced Feature (Experimental): LoRA Handoff is an advanced research feature for LoRA-to-base FFN knowledge transfer. In most cases, standard 3-Phase training (without Handoff) is sufficient. Details: checkpoint_lora_lifecycle.md section 7
Strategy restriction: Only available with the
moe_expert_lorastrategy.Training type restriction: DPO + Handoff is structurally incompatible. When LoRA is frozen, policy = reference, making learning impossible. Use Handoff only with SFT or ORPO. With
dense_lora/mixture_lora, base weights are frozen, so "knowledge transfer" does not apply.
Goal
Learn how to gradually fade LoRA and transfer knowledge to base FFN
in the moe_expert_lora strategy.
- First half of training: Converge quickly using LoRA delta
- Second half of training: Reduce LoRA scale from 1.0 to 0.0 while strengthening base FFN learning
- Final checkpoint: Can save a pure base model without any LoRA parameters
1. Why LoRA Handoff?
The moe_expert_lora strategy extends FFN into MoE and wraps each expert with LoRA.
While this approach is efficient, the final model retains LoRA adapters, which means:
- Additional computation during inference (base + LoRA delta)
- Increased checkpoint size
- Increased deployment complexity
LoRA Handoff solves these issues:
- Start with fast convergence using LoRA in the early stages
- After unfreezing
base_ffn, gradually reduce LoRA influence (lora_scale: 1.0 -> 0.0) - Simultaneously increase base_ffn LR so base weights absorb the role of LoRA
- Ultimately obtain a pure model without LoRA
Why Only moe_expert_lora?
| Strategy | Can base_ffn be trained? | Handoff meaning |
|---|---|---|
dense_lora |
frozen (default) | scale=0 -> reverts to original base. No transfer |
mixture_lora |
frozen (default) | Same. Single base, not trained |
moe_expert_lora |
Each expert copy can be unfrozen | scale down + base LR up -> knowledge transfer is viable |
With dense_lora/mixture_lora, setting lora_scale=0 + remove does not "transfer knowledge" --
the changes simply disappear and the model reverts to the original base. Therefore, training.lora_handoff
is only allowed when injection.strategy=moe_expert_lora.
2. Basic Configuration Example
# Based on configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml
device: cuda:0
backbone: qwen3
model_name: Qwen/Qwen3.5-0.8B-Base
injection:
strategy: moe_expert_lora # <- Required: handoff only works with this strategy
lora_r: 48
lora_alpha: 96
lora_dropout: 0.05
num_experts: 4
top_k: 2
target_keywords: [gate_proj, up_proj, down_proj]
start_layer: 0
num_layers: 0
attn_lora:
enabled: true
keywords: [q_proj, v_proj]
moe:
router_z_loss_coef: 0.001
load_balance:
type: aux_loss
aux_loss_coef: 0.01
router_dtype: float32
training:
type: sft
phases:
- step: 0
trainable: ["router"] # Phase 0: Router warmup
- step: 500
trainable: ["lora", "attn_lora"] # Phase 1: LoRA training
- step: 4000
trainable: ["lora", "attn_lora", "router", "base_ffn"] # Phase 2: base_ffn unfreeze
base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
# target_layers omitted -> automatically extracted from injection.start_layer/num_layers
lr: 1.0e-5
weight_decay: 0.01
warmup_steps: 200
max_train_steps: 10000
batch_size: 4
grad_accum_steps: 4
max_grad_norm: 1.0
log_steps: 50
save_steps: 2000
val_steps: 1000
# -- LoRA Handoff --
lora_handoff:
expert_lora:
start_step: 4000 # Start fade simultaneously with base_ffn unfreeze
duration_steps: 4000 # Gradually decrease over 4000 steps
end_scale: 0.0 # Complete removal
curve: cosine # Cosine decay (gradual early, steep later)
end_action: freeze # Freeze LoRA parameters after fade completes
attn_lora:
start_step: 5000
duration_steps: 3000
end_scale: 0.0
curve: linear
end_action: freeze
base_ffn_ramp:
start_step: 4000
end_step: 8000
start_multiplier: 1.0
end_multiplier: 3.0
export:
remove_adapters_if_zero_scale: true
Preconditions
The validator checks these automatically:
| Condition | On violation |
|---|---|
P1: injection.strategy = moe_expert_lora |
ConfigValidationError |
P2: training.phases must contain at least one base_ffn |
ConfigValidationError |
P3: lora_handoff.attn_lora -> injection.attn_lora.enabled: true |
ConfigValidationError |
3. Verifying Handoff in Training Logs
When Handoff is active, the following events are automatically printed in the training logs:
[Train] LoRA Handoff Scheduler enabled: ['expert_lora', 'attn_lora', 'base_ffn_ramp', 'export']
[Handoff] Schedule: expert_lora(step 4000→6000, curve=cosine, end_scale=0.0, end_action=freeze), attn_lora(step 4000→6000, curve=linear, ...), base_ffn_ramp(step 2000→4000, LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp started at step 2000 (LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp completed at step 4000 (LR ×3.0)
[Handoff] expert_lora fade started at step 4000 (curve=cosine, end_scale=0.00, duration=2000 steps)
[Handoff] attn_lora fade started at step 4000 (curve=linear, end_scale=0.00, duration=2000 steps)
...
[Phase2] Step 480/600 ... | Handoff[expert_lora=0.50, attn_lora=0.50, ffn_lr_mult=3.00]
...
[Handoff] expert_lora fade completed at step 6000 (scale=0.0000)
[Handoff] expert_lora frozen at step 6000 (scale=0.0000)
[Handoff] attn_lora fade completed at step 6000 (scale=0.0000)
[Handoff] attn_lora frozen at step 6000 (scale=0.0000)
- Schedule summary: The full schedule is printed in a single line at the start of training
- Milestone events: Printed once at fade start/completion, freeze, and ramp start/completion
- Periodic status: Current scale and LR multiplier are appended to the log line every
log_steps
4. Understanding Fade Curves
Linear Fade
lora_scale
1.0 ┤████████████
│ ╲
0.5 ┤ ╲
│ ╲
0.0 ┤ ████████
└──┬──────┬──────┬──────→ step
start mid end
Decreases at a uniform rate. Simple and predictable.
Cosine Fade
lora_scale
1.0 ┤████████████╲
│ ╲
0.5 ┤ ╲
│ ╲
0.0 ┤ ╲██████
└──┬──────┬──────┬──────→ step
start mid end
Gradual at first, steep later. Recommended when you want to smoothly reduce LoRA dependency.
5. base_ffn_ramp (LR Compensation)
As LoRA diminishes, base FFN must take over its role.
If base FFN cannot absorb the contribution quickly enough while LoRA's contribution decreases,
a loss spike occurs. base_ffn_ramp gradually increases the base_ffn LR during this interval
to accelerate knowledge transfer.
Interval Configuration Principles
Key: ramp interval = expert_lora fade interval
expert_lora: start_step=4000, duration_steps=4000 → fade interval 4000~8000
base_ffn_ramp: start_step=4000, end_step=8000 → aligned to the same interval
ramp.start_step = expert_lora.start_step-- Start LR compensation when LoRA begins to faderamp.end_step = expert_lora.start_step + duration_steps-- End ramp simultaneously with fade completion- After fade completes, LoRA contribution is 0, so additional LR compensation is unnecessary; excessive LR can actually cause divergence
Timing visualization (total 10k steps):
Step 0───500───4000───────────8000──────10000
│ │ │ │ │
│ │ ├─ expert_lora fade (1.0→0.0, cosine)
│ │ ├─ base_ffn_ramp (LR ×1.0→×3.0)
│ │ │ │
Phase0 Phase1 Phase2 ├─ Handoff complete
router LoRA base_ffn └─ Remaining 2k steps: stabilization with base only
training unfreeze
Anti-Patterns
| Configuration | Problem |
|---|---|
| ramp ends earlier than fade (e.g., 4k~6k) | No LR compensation in the second half of fade while LoRA keeps decreasing -> loss spike |
| ramp ends later than fade (e.g., 4k~10k) | LR keeps increasing after fade completes -> risk of divergence |
| end_multiplier x5 or higher | Risk of gradient explosion (recommended: x2~x3) |
YAML Example
training:
phases:
- step: 0
trainable: ["router"]
- step: 500
trainable: ["lora", "attn_lora"]
- step: 4000
trainable: ["lora", "attn_lora", "router", "base_ffn"] # <- base_ffn required
base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
lora_handoff:
expert_lora:
start_step: 4000
duration_steps: 4000 # fade interval: 4000~8000
end_scale: 0.0
curve: cosine
end_action: freeze
base_ffn_ramp:
start_step: 4000 # Start simultaneously with expert_lora fade
end_step: 8000 # End simultaneously with expert_lora fade completion
start_multiplier: 1.0 # Maintain existing LR
end_multiplier: 3.0 # Gradually increase up to 3x during fade interval
Note: To use
base_ffn_ramp, thebase_ffngroup must be active intraining.phases(P2).
Automatic Validator Checks
validate_config() automatically checks the following:
| Check | Level | Description |
|---|---|---|
| fade start < base_ffn phase | Error | Fade starts while base_ffn is still frozen |
| fade start >= max_train_steps | Error | Fade never executes |
| ramp ends earlier than fade | Warning | No LR compensation in later fade stages -> risk of loss spike |
| ramp ends later than fade | Warning | Unnecessary LR boost -> risk of divergence |
| fade/ramp exceeds max_train_steps | Warning | Does not complete within training |
Incorrect configurations trigger an immediate error or warning before training starts.
6. Export Options
remove_adapters_if_zero_scale: true
At training completion, replaces LoRA modules with lora_scale=0.0 with plain nn.Linear.
Result: A pure model saved without LoRA parameters.
merge_adapters_on_export: true
Merges LoRA delta into base weights and then replaces with plain nn.Linear.
Useful when end_scale > 0 (absorbing partial LoRA into base).
export:
merge_adapters_on_export: true # Merge LoRA into base
7. Checkpoint Structure and Bench Compatibility
Checkpoint Comparison Before/After Handoff
| Item | Without handoff (default) | With handoff (end_scale=0) |
|---|---|---|
| Expert FFN | N copies (each with base_layer + lora_A/B) | N copies (weight only after LoRA merge) |
| Router | present | present |
| LoRA parameters | present | absent (removed after merge) |
| Checkpoint size | large (includes LoRA) | small (LoRA removed) |
Before handoff:
├── experts.0.gate_proj.base_layer.weight + lora_A + lora_B
├── experts.1.gate_proj.base_layer.weight + lora_A + lora_B
├── ...
└── router.weight
After handoff (end_scale=0, remove_adapters_if_zero_scale=true):
├── experts.0.gate_proj.weight <- LoRA merged into base
├── experts.1.gate_proj.weight
├── ...
└── router.weight <- Router preserved
Key point: In both cases, the N-expert + router structure is fully preserved. Handoff only removes LoRA; it does not reduce the MoE structure.
Bench Loading
LoRA Handoff checkpoints are automatically recognized by bench:
LocalHFClientreads injection settings fromresolved_config.json- Reconstructs MoE structure on the base HF model with
replace_ffn_with_moe_inplace() - Handoff checkpoint: no LoRA keys -> loads directly into MoE model
- Non-Handoff checkpoint: merges LoRA within experts before loading
- Runs inference with MoE structure (preserving learned expert routing patterns)
No additional configuration needed:
eulerforge bench \
--preset configs/bench/sft_target_only.yml \
--target-output-dir outputs/run_handoff/
8. Partial Fade (end_scale > 0)
When you want to retain some LoRA influence instead of complete removal:
lora_handoff:
expert_lora:
start_step: 4000
duration_steps: 4000
end_scale: 0.3 # Retain 30%
end_action: keep # Keep trainable
In this case, partial merge is possible with export.merge_adapters_on_export: true.
9. Validation
Configuration validation is performed automatically by validate_config() (rules H1~H7).
Key validation items:
- P1: Error if
injection.strategyis notmoe_expert_lora - P2: Error if
training.phasesdoes not containbase_ffn start_step >= 0,duration_steps > 0end_scalemust be in the range[0, 1]curvemust belinearorcosineend_actionmust bekeeporfreezebase_ffn_ramp.end_step > start_stepattn_loraschedule requiresinjection.attn_lora.enabled
# Validation only
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml --validate-only
Details: lora_handoff_spec.md
10. Presets
| Preset | Model | Description |
|---|---|---|
qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml |
Qwen3.5-0.8B | 3-phase + handoff |
llama3_1b_moe_expert_lora_sft_handoff.yml |
Llama-3.2-1B | 3-phase + handoff |