14. LoRA Handoff Scheduling

Prerequisites: Completed Tutorial 03 (FFN MoE Expert LoRA)

Research / Advanced Feature (Experimental): LoRA Handoff is an advanced research feature for LoRA-to-base FFN knowledge transfer. In most cases, standard 3-Phase training (without Handoff) is sufficient. Details: checkpoint_lora_lifecycle.md section 7

Strategy restriction: Only available with the moe_expert_lora strategy.

Training type restriction: DPO + Handoff is structurally incompatible. When LoRA is frozen, policy = reference, making learning impossible. Use Handoff only with SFT or ORPO. With dense_lora/mixture_lora, base weights are frozen, so "knowledge transfer" does not apply.

Goal

Learn how to gradually fade LoRA and transfer knowledge to base FFN in the moe_expert_lora strategy.

First half of training: Converge quickly using LoRA delta
Second half of training: Reduce LoRA scale from 1.0 to 0.0 while strengthening base FFN learning
Final checkpoint: Can save a pure base model without any LoRA parameters

1. Why LoRA Handoff?

The moe_expert_lora strategy extends FFN into MoE and wraps each expert with LoRA. While this approach is efficient, the final model retains LoRA adapters, which means:

Additional computation during inference (base + LoRA delta)
Increased checkpoint size
Increased deployment complexity

LoRA Handoff solves these issues:

Start with fast convergence using LoRA in the early stages
After unfreezing base_ffn, gradually reduce LoRA influence (lora_scale: 1.0 -> 0.0)
Simultaneously increase base_ffn LR so base weights absorb the role of LoRA
Ultimately obtain a pure model without LoRA

Why Only moe_expert_lora?

Strategy	Can base_ffn be trained?	Handoff meaning
`dense_lora`	frozen (default)	scale=0 -> reverts to original base. No transfer
`mixture_lora`	frozen (default)	Same. Single base, not trained
`moe_expert_lora`	Each expert copy can be unfrozen	scale down + base LR up -> knowledge transfer is viable

With dense_lora/mixture_lora, setting lora_scale=0 + remove does not "transfer knowledge" -- the changes simply disappear and the model reverts to the original base. Therefore, training.lora_handoff is only allowed when injection.strategy=moe_expert_lora.

2. Basic Configuration Example

# Based on configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml
device: cuda:0
backbone: qwen3
model_name: Qwen/Qwen3.5-0.8B-Base

injection:
  strategy: moe_expert_lora      # <- Required: handoff only works with this strategy
  lora_r: 48
  lora_alpha: 96
  lora_dropout: 0.05
  num_experts: 4
  top_k: 2
  target_keywords: [gate_proj, up_proj, down_proj]
  start_layer: 0
  num_layers: 0
  attn_lora:
    enabled: true
    keywords: [q_proj, v_proj]

moe:
  router_z_loss_coef: 0.001
  load_balance:
    type: aux_loss
    aux_loss_coef: 0.01
  router_dtype: float32

training:
  type: sft
  phases:
    - step: 0
      trainable: ["router"]              # Phase 0: Router warmup
    - step: 500
      trainable: ["lora", "attn_lora"]   # Phase 1: LoRA training
    - step: 4000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # Phase 2: base_ffn unfreeze
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
      # target_layers omitted -> automatically extracted from injection.start_layer/num_layers
  lr: 1.0e-5
  weight_decay: 0.01
  warmup_steps: 200
  max_train_steps: 10000
  batch_size: 4
  grad_accum_steps: 4
  max_grad_norm: 1.0
  log_steps: 50
  save_steps: 2000
  val_steps: 1000

  # -- LoRA Handoff --
  lora_handoff:
    expert_lora:
      start_step: 4000        # Start fade simultaneously with base_ffn unfreeze
      duration_steps: 4000    # Gradually decrease over 4000 steps
      end_scale: 0.0          # Complete removal
      curve: cosine           # Cosine decay (gradual early, steep later)
      end_action: freeze      # Freeze LoRA parameters after fade completes
    attn_lora:
      start_step: 5000
      duration_steps: 3000
      end_scale: 0.0
      curve: linear
      end_action: freeze
    base_ffn_ramp:
      start_step: 4000
      end_step: 8000
      start_multiplier: 1.0
      end_multiplier: 3.0
    export:
      remove_adapters_if_zero_scale: true

Preconditions

The validator checks these automatically:

Condition	On violation
P1: `injection.strategy = moe_expert_lora`	`ConfigValidationError`
P2: `training.phases` must contain at least one `base_ffn`	`ConfigValidationError`
P3: `lora_handoff.attn_lora` -> `injection.attn_lora.enabled: true`	`ConfigValidationError`

3. Verifying Handoff in Training Logs

When Handoff is active, the following events are automatically printed in the training logs:

[Train] LoRA Handoff Scheduler enabled: ['expert_lora', 'attn_lora', 'base_ffn_ramp', 'export']
[Handoff] Schedule: expert_lora(step 4000→6000, curve=cosine, end_scale=0.0, end_action=freeze), attn_lora(step 4000→6000, curve=linear, ...), base_ffn_ramp(step 2000→4000, LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp started at step 2000 (LR ×1.0→×3.0)
...
[Handoff] base_ffn_ramp completed at step 4000 (LR ×3.0)
[Handoff] expert_lora fade started at step 4000 (curve=cosine, end_scale=0.00, duration=2000 steps)
[Handoff] attn_lora fade started at step 4000 (curve=linear, end_scale=0.00, duration=2000 steps)
...
[Phase2] Step 480/600 ... | Handoff[expert_lora=0.50, attn_lora=0.50, ffn_lr_mult=3.00]
...
[Handoff] expert_lora fade completed at step 6000 (scale=0.0000)
[Handoff] expert_lora frozen at step 6000 (scale=0.0000)
[Handoff] attn_lora fade completed at step 6000 (scale=0.0000)
[Handoff] attn_lora frozen at step 6000 (scale=0.0000)

Schedule summary: The full schedule is printed in a single line at the start of training
Milestone events: Printed once at fade start/completion, freeze, and ramp start/completion
Periodic status: Current scale and LR multiplier are appended to the log line every log_steps

4. Understanding Fade Curves

Linear Fade

lora_scale
1.0 ┤████████████
    │            ╲
0.5 ┤             ╲
    │              ╲
0.0 ┤               ████████
    └──┬──────┬──────┬──────→ step
       start  mid    end

Decreases at a uniform rate. Simple and predictable.

Cosine Fade

lora_scale
1.0 ┤████████████╲
    │             ╲
0.5 ┤              ╲
    │               ╲
0.0 ┤                ╲██████
    └──┬──────┬──────┬──────→ step
       start  mid    end

Gradual at first, steep later. Recommended when you want to smoothly reduce LoRA dependency.

5. base_ffn_ramp (LR Compensation)

As LoRA diminishes, base FFN must take over its role. If base FFN cannot absorb the contribution quickly enough while LoRA's contribution decreases, a loss spike occurs. base_ffn_ramp gradually increases the base_ffn LR during this interval to accelerate knowledge transfer.

Interval Configuration Principles

Key: ramp interval = expert_lora fade interval

expert_lora:   start_step=4000, duration_steps=4000  → fade interval 4000~8000
base_ffn_ramp: start_step=4000, end_step=8000        → aligned to the same interval

ramp.start_step = expert_lora.start_step -- Start LR compensation when LoRA begins to fade
ramp.end_step = expert_lora.start_step + duration_steps -- End ramp simultaneously with fade completion
After fade completes, LoRA contribution is 0, so additional LR compensation is unnecessary; excessive LR can actually cause divergence

Timing visualization (total 10k steps):

Step 0───500───4000───────────8000──────10000
     │    │     │               │          │
     │    │     ├─ expert_lora fade (1.0→0.0, cosine)
     │    │     ├─ base_ffn_ramp (LR ×1.0→×3.0)
     │    │     │         │
     Phase0  Phase1    Phase2   ├─ Handoff complete
     router  LoRA    base_ffn   └─ Remaining 2k steps: stabilization with base only
             training unfreeze

Anti-Patterns

Configuration	Problem
ramp ends earlier than fade (e.g., 4k~6k)	No LR compensation in the second half of fade while LoRA keeps decreasing -> loss spike
ramp ends later than fade (e.g., 4k~10k)	LR keeps increasing after fade completes -> risk of divergence
end_multiplier x5 or higher	Risk of gradient explosion (recommended: x2~x3)

YAML Example

training:
  phases:
    - step: 0
      trainable: ["router"]
    - step: 500
      trainable: ["lora", "attn_lora"]
    - step: 4000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # <- base_ffn required
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]

  lora_handoff:
    expert_lora:
      start_step: 4000
      duration_steps: 4000       # fade interval: 4000~8000
      end_scale: 0.0
      curve: cosine
      end_action: freeze
    base_ffn_ramp:
      start_step: 4000           # Start simultaneously with expert_lora fade
      end_step: 8000             # End simultaneously with expert_lora fade completion
      start_multiplier: 1.0      # Maintain existing LR
      end_multiplier: 3.0        # Gradually increase up to 3x during fade interval

Note: To use base_ffn_ramp, the base_ffn group must be active in training.phases (P2).

Automatic Validator Checks

validate_config() automatically checks the following:

Check	Level	Description
fade start < base_ffn phase	Error	Fade starts while base_ffn is still frozen
fade start >= max_train_steps	Error	Fade never executes
ramp ends earlier than fade	Warning	No LR compensation in later fade stages -> risk of loss spike
ramp ends later than fade	Warning	Unnecessary LR boost -> risk of divergence
fade/ramp exceeds max_train_steps	Warning	Does not complete within training

Incorrect configurations trigger an immediate error or warning before training starts.

6. Export Options

`remove_adapters_if_zero_scale: true`

At training completion, replaces LoRA modules with lora_scale=0.0 with plain nn.Linear. Result: A pure model saved without LoRA parameters.

`merge_adapters_on_export: true`

Merges LoRA delta into base weights and then replaces with plain nn.Linear. Useful when end_scale > 0 (absorbing partial LoRA into base).

export:
  merge_adapters_on_export: true   # Merge LoRA into base

7. Checkpoint Structure and Bench Compatibility

Checkpoint Comparison Before/After Handoff

Item	Without handoff (default)	With handoff (end_scale=0)
Expert FFN	N copies (each with base_layer + lora_A/B)	N copies (weight only after LoRA merge)
Router	present	present
LoRA parameters	present	absent (removed after merge)
Checkpoint size	large (includes LoRA)	small (LoRA removed)

Before handoff:
├── experts.0.gate_proj.base_layer.weight + lora_A + lora_B
├── experts.1.gate_proj.base_layer.weight + lora_A + lora_B
├── ...
└── router.weight

After handoff (end_scale=0, remove_adapters_if_zero_scale=true):
├── experts.0.gate_proj.weight    <- LoRA merged into base
├── experts.1.gate_proj.weight
├── ...
└── router.weight                 <- Router preserved

Key point: In both cases, the N-expert + router structure is fully preserved. Handoff only removes LoRA; it does not reduce the MoE structure.

Bench Loading

LoRA Handoff checkpoints are automatically recognized by bench:

LocalHFClient reads injection settings from resolved_config.json
Reconstructs MoE structure on the base HF model with replace_ffn_with_moe_inplace()
Handoff checkpoint: no LoRA keys -> loads directly into MoE model
Non-Handoff checkpoint: merges LoRA within experts before loading
Runs inference with MoE structure (preserving learned expert routing patterns)

No additional configuration needed:

eulerforge bench \
  --preset configs/bench/sft_target_only.yml \
  --target-output-dir outputs/run_handoff/

8. Partial Fade (end_scale > 0)

When you want to retain some LoRA influence instead of complete removal:

lora_handoff:
  expert_lora:
    start_step: 4000
    duration_steps: 4000
    end_scale: 0.3         # Retain 30%
    end_action: keep       # Keep trainable

In this case, partial merge is possible with export.merge_adapters_on_export: true.

9. Validation

Configuration validation is performed automatically by validate_config() (rules H1~H7). Key validation items:

P1: Error if injection.strategy is not moe_expert_lora
P2: Error if training.phases does not contain base_ffn
start_step >= 0, duration_steps > 0
end_scale must be in the range [0, 1]
curve must be linear or cosine
end_action must be keep or freeze
base_ffn_ramp.end_step > start_step
attn_lora schedule requires injection.attn_lora.enabled

# Validation only
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml --validate-only

Details: lora_handoff_spec.md

10. Presets

Preset	Model	Description
`qwen3.5_0.8b_moe_expert_lora_sft_handoff.yml`	Qwen3.5-0.8B	3-phase + handoff
`llama3_1b_moe_expert_lora_sft_handoff.yml`	Llama-3.2-1B	3-phase + handoff

← Prev 13. LLaMA Fine-Tuning 15. Loading Models Next →