Home > EulerForge > Tutorials > 14. LoRA Handoff Scheduling

14. LoRA Handoff Scheduling

Prerequisites: Completed Tutorial 03 (FFN MoE Expert LoRA)

Research / Advanced Feature (Experimental): LoRA Handoff is an advanced research feature for LoRA-to-base FFN knowledge transfer. In most cases, standard 3-Phase training (without Handoff) is sufficient. Details: checkpoint_lora_lifecycle.md section 7

Strategy restriction: Only available with the moe_expert_lora strategy.

Training type restriction: DPO + Handoff is structurally incompatible. When LoRA is frozen, policy = reference, making learning impossible. Use Handoff only with SFT or ORPO. With dense_lora/mixture_lora, base weights are frozen, so "knowledge transfer" does not apply.


Goal

Learn how to gradually fade LoRA and transfer knowledge to base FFN in the moe_expert_lora strategy.


1. Why LoRA Handoff?

The moe_expert_lora strategy extends FFN into MoE and wraps each expert with LoRA. While this approach is efficient, the final model retains LoRA adapters, which means:

LoRA Handoff solves these issues:

  1. Start with fast convergence using LoRA in the early stages
  2. After unfreezing base_ffn, gradually reduce LoRA influence (lora_scale: 1.0 -> 0.0)
  3. Simultaneously increase base_ffn LR so base weights absorb the role of LoRA
  4. Ultimately obtain a pure model without LoRA

Why Only moe_expert_lora?

Strategy Can base_ffn be trained? Handoff meaning
dense_lora frozen (default) scale=0 -> reverts to original base. No transfer
mixture_lora frozen (default) Same. Single base, not trained
moe_expert_lora Each expert copy can be unfrozen scale down + base LR up -> knowledge transfer is viable

With dense_lora/mixture_lora, setting lora_scale=0 + remove does not "transfer knowledge" -- the changes simply disappear and the model reverts to the original base. Therefore, training.lora_handoff is only allowed when injection.strategy=moe_expert_lora.


2. Basic Configuration Example

# Based on configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml
device: cuda:0
backbone: qwen3
model_name: Qwen/Qwen3.5-0.8B-Base

injection:
  strategy: moe_expert_lora      # <- Required: handoff only works with this strategy
  lora_r: 48
  lora_alpha: 96
  lora_dropout: 0.05
  num_experts: 4
  top_k: 2
  target_keywords: [gate_proj, up_proj, down_proj]
  start_layer: 0
  num_layers: 0
  attn_lora:
    enabled: true
    keywords: [q_proj, v_proj]

moe:
  router_z_loss_coef: 0.001
  load_balance:
    type: aux_loss
    aux_loss_coef: 0.01
  router_dtype: float32

training:
  type: sft
  phases:
    - step: 0
      trainable: ["router"]              # Phase 0: Router warmup
    - step: 500
      trainable: ["lora", "attn_lora"]   # Phase 1: LoRA training
    - step: 4000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # Phase 2: base_ffn unfreeze
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
      # target_layers omitted -> automatically extracted from injection.start_layer/num_layers
  lr: 1.0e-5
  weight_decay: 0.01
  warmup_steps: 200
  max_train_steps: 10000
  batch_size: 4
  grad_accum_steps: 4
  max_grad_norm: 1.0
  log_steps: 50
  save_steps: 2000
  val_steps: 1000

  # -- LoRA Handoff --
  lora_handoff:
    expert_lora:
      start_step: 4000        # Start fade simultaneously with base_ffn unfreeze
      duration_steps: 4000    # Gradually decrease over 4000 steps
      end_scale: 0.0          # Complete removal
      curve: cosine           # Cosine decay (gradual early, steep later)
      end_action: freeze      # Freeze LoRA parameters after fade completes
    attn_lora:
      start_step: 5000
      duration_steps: 3000
      end_scale: 0.0
      curve: linear
      end_action: freeze
    base_ffn_ramp:
      start_step: 4000
      end_step: 8000
      start_multiplier: 1.0
      end_multiplier: 3.0
    export:
      remove_adapters_if_zero_scale: true

Preconditions

The validator checks these automatically:

Condition On violation
P1: injection.strategy = moe_expert_lora ConfigValidationError
P2: training.phases must contain at least one base_ffn ConfigValidationError
P3: lora_handoff.attn_lora -> injection.attn_lora.enabled: true ConfigValidationError

3. Verifying Handoff in Training Logs

When Handoff is active, the following events are automatically printed in the training logs:

[Train] LoRA Handoff Scheduler enabled: ['expert_lora', 'attn_lora', 'base_ffn_ramp', 'export']
[Handoff] Schedule: expert_lora(step 4000→6000, curve=cosine, end_scale=0.0, end_action=freeze), attn_lora(step 4000→6000, curve=linear, ...), base_ffn_ramp(step 2000→4000, LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp started at step 2000 (LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp completed at step 4000 (LR ×3.0)
[Handoff] expert_lora fade started at step 4000 (curve=cosine, end_scale=0.00, duration=2000 steps)
[Handoff] attn_lora fade started at step 4000 (curve=linear, end_scale=0.00, duration=2000 steps)
...
[Phase2] Step 480/600 ... | Handoff[expert_lora=0.50, attn_lora=0.50, ffn_lr_mult=3.00]
...
[Handoff] expert_lora fade completed at step 6000 (scale=0.0000)
[Handoff] expert_lora frozen at step 6000 (scale=0.0000)
[Handoff] attn_lora fade completed at step 6000 (scale=0.0000)
[Handoff] attn_lora frozen at step 6000 (scale=0.0000)

4. Understanding Fade Curves

Linear Fade

lora_scale
1.0 ┤████████████
    │            ╲
0.5 ┤             ╲
    │              ╲
0.0 ┤               ████████
    └──┬──────┬──────┬──────→ step
       start  mid    end

Decreases at a uniform rate. Simple and predictable.

Cosine Fade

lora_scale
1.0 ┤████████████╲
    │             ╲
0.5 ┤              ╲
    │               ╲
0.0 ┤                ╲██████
    └──┬──────┬──────┬──────→ step
       start  mid    end

Gradual at first, steep later. Recommended when you want to smoothly reduce LoRA dependency.


5. base_ffn_ramp (LR Compensation)

As LoRA diminishes, base FFN must take over its role. If base FFN cannot absorb the contribution quickly enough while LoRA's contribution decreases, a loss spike occurs. base_ffn_ramp gradually increases the base_ffn LR during this interval to accelerate knowledge transfer.

Interval Configuration Principles

Key: ramp interval = expert_lora fade interval

expert_lora:   start_step=4000, duration_steps=4000  → fade interval 4000~8000
base_ffn_ramp: start_step=4000, end_step=8000        → aligned to the same interval

Timing visualization (total 10k steps):

Step 0───500───4000───────────8000──────10000
     │    │     │               │          │
     │    │     ├─ expert_lora fade (1.0→0.0, cosine)
     │    │     ├─ base_ffn_ramp (LR ×1.0→×3.0)
     │    │     │         │
     Phase0  Phase1    Phase2   ├─ Handoff complete
     router  LoRA    base_ffn   └─ Remaining 2k steps: stabilization with base only
             training unfreeze

Anti-Patterns

Configuration Problem
ramp ends earlier than fade (e.g., 4k~6k) No LR compensation in the second half of fade while LoRA keeps decreasing -> loss spike
ramp ends later than fade (e.g., 4k~10k) LR keeps increasing after fade completes -> risk of divergence
end_multiplier x5 or higher Risk of gradient explosion (recommended: x2~x3)

YAML Example

training:
  phases:
    - step: 0
      trainable: ["router"]
    - step: 500
      trainable: ["lora", "attn_lora"]
    - step: 4000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # <- base_ffn required
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]

  lora_handoff:
    expert_lora:
      start_step: 4000
      duration_steps: 4000       # fade interval: 4000~8000
      end_scale: 0.0
      curve: cosine
      end_action: freeze
    base_ffn_ramp:
      start_step: 4000           # Start simultaneously with expert_lora fade
      end_step: 8000             # End simultaneously with expert_lora fade completion
      start_multiplier: 1.0      # Maintain existing LR
      end_multiplier: 3.0        # Gradually increase up to 3x during fade interval

Note: To use base_ffn_ramp, the base_ffn group must be active in training.phases (P2).

Automatic Validator Checks

validate_config() automatically checks the following:

Check Level Description
fade start < base_ffn phase Error Fade starts while base_ffn is still frozen
fade start >= max_train_steps Error Fade never executes
ramp ends earlier than fade Warning No LR compensation in later fade stages -> risk of loss spike
ramp ends later than fade Warning Unnecessary LR boost -> risk of divergence
fade/ramp exceeds max_train_steps Warning Does not complete within training

Incorrect configurations trigger an immediate error or warning before training starts.


6. Export Options

remove_adapters_if_zero_scale: true

At training completion, replaces LoRA modules with lora_scale=0.0 with plain nn.Linear. Result: A pure model saved without LoRA parameters.

merge_adapters_on_export: true

Merges LoRA delta into base weights and then replaces with plain nn.Linear. Useful when end_scale > 0 (absorbing partial LoRA into base).

export:
  merge_adapters_on_export: true   # Merge LoRA into base

7. Checkpoint Structure and Bench Compatibility

Checkpoint Comparison Before/After Handoff

Item Without handoff (default) With handoff (end_scale=0)
Expert FFN N copies (each with base_layer + lora_A/B) N copies (weight only after LoRA merge)
Router present present
LoRA parameters present absent (removed after merge)
Checkpoint size large (includes LoRA) small (LoRA removed)
Before handoff:
├── experts.0.gate_proj.base_layer.weight + lora_A + lora_B
├── experts.1.gate_proj.base_layer.weight + lora_A + lora_B
├── ...
└── router.weight

After handoff (end_scale=0, remove_adapters_if_zero_scale=true):
├── experts.0.gate_proj.weight    <- LoRA merged into base
├── experts.1.gate_proj.weight
├── ...
└── router.weight                 <- Router preserved

Key point: In both cases, the N-expert + router structure is fully preserved. Handoff only removes LoRA; it does not reduce the MoE structure.

Bench Loading

LoRA Handoff checkpoints are automatically recognized by bench:

  1. LocalHFClient reads injection settings from resolved_config.json
  2. Reconstructs MoE structure on the base HF model with replace_ffn_with_moe_inplace()
  3. Handoff checkpoint: no LoRA keys -> loads directly into MoE model
  4. Non-Handoff checkpoint: merges LoRA within experts before loading
  5. Runs inference with MoE structure (preserving learned expert routing patterns)

No additional configuration needed:

eulerforge bench \
  --preset configs/bench/sft_target_only.yml \
  --target-output-dir outputs/run_handoff/

8. Partial Fade (end_scale > 0)

When you want to retain some LoRA influence instead of complete removal:

lora_handoff:
  expert_lora:
    start_step: 4000
    duration_steps: 4000
    end_scale: 0.3         # Retain 30%
    end_action: keep       # Keep trainable

In this case, partial merge is possible with export.merge_adapters_on_export: true.


9. Validation

Configuration validation is performed automatically by validate_config() (rules H1~H7). Key validation items:

# Validation only
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml --validate-only

Details: lora_handoff_spec.md


10. Presets

Preset Model Description
qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml Qwen3.5-0.8B 3-phase + handoff
llama3_1b_moe_expert_lora_sft_handoff.yml Llama-3.2-1B 3-phase + handoff