5. DPO Training
Overview
DPO (Direct Preference Optimization) is a training method that aligns models using preferred/rejected response pairs. In EulerForge, DPO operates independently of the injection strategy and can be combined with all strategies (dense_lora, mixture_lora, moe_expert_lora, native_moe_expert_lora).
- Suitable for: RLHF alternative, model alignment, preference-based fine-tuning
- Key difference from SFT: Single response training vs preferred/rejected pair comparison training
- Reference preset:
configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml - Prerequisite: Always apply this to a model that has completed SFT training first.
Why SFT Should Come First
DPO/ORPO and other preference training methods are effective only when applied to models that already have instruction-following ability. Applying DPO directly to a base model means learning "which answer is better" while not knowing "how to answer," which can actually decrease benchmark scores.
Correct order: SFT (instruction learning) -> DPO (preference alignment) Incorrect order: Base Model -> DPO (learning preference without basic ability)
Run SFT first, then specify its checkpoint (
final/) as themodel_namefor DPO.
SFT vs DPO Comparison
| Item | SFT | DPO |
|---|---|---|
| Data format | Single response (input_ids, labels) |
Preferred/rejected pairs (chosen_*, rejected_*) |
| Loss function | Cross-entropy loss | DPO loss (log probability ratio) |
| Reference model | Not required | Required (substituted by disabling adapters) |
| Configuration key | training.type: sft |
training.type: dpo, training.dpo_beta |
| Effective batch size | Batch size as-is | Batch size x 2 (chosen + rejected) |
| Typical learning rate | 1.0e-5 |
5.0e-6 (smaller) |
Prerequisites
- EulerForge installation complete (see Getting Started)
- Data preprocessing complete (
data/dpo_10k_raw.jsonlgenerated) - Understanding of SFT fine-tuning concepts (see injection strategy tutorials)
1. DPO Data Format
DPO training requires preferred (chosen) / rejected response pairs.
Raw Data (Recommended)
Using data.format=raw automatically tokenizes text JSONL at training time:
{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Rejected response"}
data/dpo_10k_raw.jsonl: Standard prompted_preference format (converted in Data Preprocessing)- Prompt tokens are automatically masked with
-100.
Processed Data
Pre-tokenized JSONL is also supported:
| Field | Type | Description |
|---|---|---|
chosen_input_ids |
List[int] |
Token IDs of the preferred response |
chosen_labels |
List[int] |
Labels of the preferred response (prompt masked with -100) |
rejected_input_ids |
List[int] |
Token IDs of the rejected response |
rejected_labels |
List[int] |
Labels of the rejected response (prompt masked with -100) |
2. How DPO Works
Core Idea
DPO compares the log probability ratio between a policy model and a reference model to increase the probability of preferred responses and decrease the probability of rejected responses.
EulerForge's Memory-Efficient Approach
Standard DPO requires loading two models (policy and reference) into memory. EulerForge uses AdapterLayerMixin to have a single model serve both roles.
[One model]
|
+-- Policy mode (default): base_layer + LoRA delta -> policy log probabilities
|
+-- Reference mode (adapter disabled): base_layer only -> reference log probabilities
Forward Process
[Batch: chosen_1, rejected_1, chosen_2, rejected_2, ...]
|
+-- 1) Policy Forward (adapters enabled)
| model(x) = base + LoRA/MoE delta
| -> policy_chosen_logps (even indices)
| -> policy_rejected_logps (odd indices)
|
+-- 2) Reference Forward (no_grad)
[Pipeline DPO] -> restore to initial LoRA state (SFT)
[Fresh DPO] -> disable adapters (base only)
-> ref_chosen_logps
-> ref_rejected_logps
DPO Loss Function
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
logits = pi_logratios - ref_logratios
loss = -log(sigmoid(beta * logits))
beta(dpo_beta): Controls preference strength. Larger values more strongly enforce the preferred/rejected difference.sigmoid: Sigmoid function. Normalizes the log probability ratio difference to the 0-1 range.
3. AdapterLayerMixin Mechanism
All adapter modules (LoRALinear, MixtureLoRALinear) inherit from AdapterLayerMixin.
How It Works
class LoRALinear(nn.Module, AdapterLayerMixin):
def forward(self, x):
if self.is_adapter_disabled(): # Reference mode
return self.base_layer(x) # Return base only
base_out = self.base_layer(x)
return base_out + self._lora_forward(x) # Policy mode
Disable Behavior by Strategy
| Adapter Module | Behavior When Disabled |
|---|---|
LoRALinear |
Returns base_layer(x) (skips LoRA delta) |
MixtureLoRALinear |
Returns base_layer(x) (skips router + experts) |
Reference Forward Code
# Pipeline DPO (SFT->DPO): uses initial LoRA state (SFT) as reference
ref_ctx = (_use_reference_lora(model, ref_lora_sd)
if ref_lora_sd is not None
else disable_adapter_layers(model))
with torch.no_grad():
with ref_ctx:
ref_outputs = model(input_ids=input_ids, attention_mask=attention_mask)
- Pipeline DPO:
_use_reference_lora(model, ref_sd)-- temporarily restores to SFT weights - Fresh DPO:
disable_adapter_layers(model)-- uses base model as reference torch.no_grad(): No gradients needed for reference model (saves memory)- The current LoRA state is automatically restored when the context manager exits.
Note: The same reference mechanism is applied in PPO's KL penalty calculation.
4. Switching from SFT to DPO
When converting an SFT preset to DPO, the changes required are minimal. The injection and moe sections remain identical.
Summary of Changes
training:
- type: sft
+ type: dpo
+ dpo_beta: 0.1
- lr: 1.0e-5
+ lr: 5.0e-6
- batch_size: 4
+ batch_size: 2
- grad_accum_steps: 4
+ grad_accum_steps: 8
- warmup_steps: 200
+ warmup_steps: 100
Why These Changes?
| Change | Reason |
|---|---|
type: dpo |
Activates DPO loss function and reference model logic |
dpo_beta: 0.1 |
Preference strength parameter (DPO-specific) |
Lower lr |
DPO fine-tunes an already-trained model, so a smaller learning rate is needed |
Lower batch_size |
DPO processes 2x tokens per batch (chosen + rejected), so this saves VRAM |
Higher grad_accum_steps |
Maintains effective batch size (2 x 8 = 16 ~ 4 x 4) |
Lower warmup_steps |
DPO starts from an already-SFT'd model, so less warmup is needed |
5. DPO-Specific Settings
dpo_beta Parameter
training:
type: dpo
dpo_beta: 0.1 # Range: 0.05 - 0.5 (typically 0.1)
| Value | Effect |
|---|---|
0.05 |
Weak preference enforcement. Stays close to reference model. Low divergence risk. |
0.1 |
Standard value. Appropriate balance for most cases. |
0.5 |
Strong preference enforcement. Widens preferred/rejected gap. Overfitting risk. |
- If
dpo_betais too small: Model barely changes (stays similar to reference model) - If
dpo_betais too large: Overfits to preference data, reduces diversity
6. Full Configuration File Walkthrough
Full contents of configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml:
# -- Model Info --
device: cuda:0 # GPU device
backbone: qwen3 # Backbone adapter: Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base # HuggingFace model ID
# -- Injection Settings (same as SFT) --
injection:
strategy: moe_expert_lora # Injection strategy (same as SFT)
lora_r: 48 # LoRA rank
lora_alpha: 96 # Scaling factor (96/48 = 2.0)
lora_dropout: 0.05 # LoRA dropout
num_experts: 4 # MoE expert count
top_k: 2 # Active experts per token
target_keywords: [gate_proj, up_proj, down_proj] # FFN targets
start_layer: 0 # Starting layer
num_layers: 0 # 0 = all
attn_lora: # Attention LoRA
enabled: true
keywords: [q_proj, v_proj]
# -- MoE Stability Settings (same as SFT) --
moe:
router_z_loss_coef: 0.001 # z-loss: prevents logit overflow
load_balance:
type: aux_loss # Auxiliary loss-based load balancing
aux_loss_coef: 0.01 # Auxiliary loss weight
router_dtype: float32 # Router precision
# -- Training Settings (DPO-specific changes) --
training:
type: dpo # [DPO] Training type
dpo_beta: 0.1 # [DPO] Preference strength parameter
phases: # 3-phase (same structure as SFT)
- step: 0 # Phase 0: Router warmup
trainable: ["router"]
- step: 2000 # Phase 1: LoRA training
trainable: ["lora", "attn_lora"]
- step: 8000 # Phase 2: Full unfreeze
trainable: ["lora", "attn_lora", "router", "base_ffn"]
base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
lr: 5.0e-6 # [DPO] Lower learning rate than SFT (1e-5)
weight_decay: 0.01 # Weight decay
warmup_steps: 100 # [DPO] Shorter warmup than SFT (200)
max_train_steps: 15000 # Maximum training steps
batch_size: 2 # [DPO] Smaller batch than SFT (4) due to chosen+rejected
grad_accum_steps: 8 # [DPO] Larger accumulation than SFT (4) to maintain effective batch
max_grad_norm: 1.0 # Gradient clipping
log_steps: 50 # Logging interval
save_steps: 1000 # Checkpoint save interval
val_steps: 500 # Validation interval
7. Running
Basic Execution
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=512
Starting DPO from an SFT Checkpoint (Recommended)
# Step 1: SFT training
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft.yml \
--set data.format=raw \
--set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--set data.max_length=512
# Step 2: DPO training from SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=512 \
--set model_name=/path/to/sft_checkpoint
Automatic Reference Model Detection: When starting DPO from an SFT checkpoint, the initial LoRA state (SFT model) is automatically used as the reference. If the initial loss is ln(2) ~ 0.693, it is normal. If the loss is greater than 0.693, there is a problem with the reference setup.
Adjusting dpo_beta
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=512 \
--set training.dpo_beta=0.05 # Conservative alignment
Preflight Check
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--preflight
8. Interpreting DPO Metrics
The following metrics are printed in logs during DPO training.
| Metric | Meaning | Desired Trend |
|---|---|---|
dpo_loss |
DPO loss value | Decreasing |
reward_chosen |
Reward for preferred response | Increasing |
reward_rejected |
Reward for rejected response | Decreasing or stable |
reward_margin |
reward_chosen - reward_rejected |
Increasing toward positive |
accuracy |
Ratio where preferred reward > rejected reward | Increasing (good at 0.7-0.8) |
Metric Interpretation Guide
Good training:
dpo_loss: 0.69 -> 0.45 (decreasing)
reward_margin: 0.0 -> 1.5 (increasing toward positive)
accuracy: 0.5 -> 0.75 (increasing)
Warning signs:
accuracy > 0.95 -> possible overfitting (reduce dpo_beta)
reward_margin < 0 -> model prefers rejected responses (check data or beta)
reward_margin > 5 -> excessive deviation from reference (overfitting, reduce lr/steps)
dpo_loss diverging -> learning rate too high
9. Combining with Other Injection Strategies
DPO can be combined with all injection strategies. Only the training section needs to change.
Plain LoRA + DPO
injection:
strategy: dense_lora # Injection strategy
# ... (Plain LoRA settings)
training:
type: dpo # Change to DPO
dpo_beta: 0.1
phases:
- step: 0
trainable: ["lora", "attn_lora"] # Single phase
lr: 5.0e-6
batch_size: 2
grad_accum_steps: 8
LoRA MoE + DPO
injection:
strategy: mixture_lora # Injection strategy
# ... (LoRA MoE settings)
training:
type: dpo # Change to DPO
dpo_beta: 0.1
phases:
- step: 0
trainable: ["router"] # 2-phase
- step: 2000
trainable: ["lora", "attn_lora"]
lr: 5.0e-6
batch_size: 2
grad_accum_steps: 8
Key point: The phase schedule depends on the injection strategy. The same phase structure is used regardless of whether DPO is used. What changes are the training parameters:
type,dpo_beta,lr,batch_size,grad_accum_steps, etc.
10. Phase 0 Router-Only and DPO Metrics
When Phase 0 trains only ["router"] in MoE strategies, DPO metrics appear as follows:
[Phase0] reward_chosen: 0.0000 | reward_rejected: 0.0000 | reward_margin: 0.0000 | accuracy: 0.0000 | dpo_loss: 0.6931
This is normal. DPO computes reward as policy_logprob - reference_logprob. In Phase 0, since LoRA is frozen, disable_adapter() and enable_adapter() produce identical outputs. Therefore reward = 0, accuracy = 0, loss = ln(2) ~ 0.6931.
Normal DPO training begins in Phase 1 when LoRA is activated.
11. max_train_steps and grad_accum_steps
max_train_steps is based on micro-steps (forward/backward count). Optimizer steps (weight updates) equal max_train_steps / grad_accum_steps.
training:
max_train_steps: 1500 # micro-steps = 1500 forward/backward passes
grad_accum_steps: 8 # 1 update after 8 accumulations
batch_size: 2 # effective batch = 2 x 8 = 16
# -> optimizer_steps = 1500 / 8 = 187
# -> total training data = 1500 x 2 = 3000 samples
In logs, Step 6/187 (micro 50/1500):
- Step 6/187 = optimizer step 6 / total 187
- micro 50/1500 = micro step 50 / total 1500
12. Debugging and Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
reward=0, loss=0.6931 in Phase 0 |
LoRA frozen -> policy=reference | Normal -- resolves after LoRA activation in Phase 1 |
accuracy stuck at 0.5 |
dpo_beta too small or data quality issue |
Increase dpo_beta or check data |
accuracy rapidly converges to 1.0 |
Overfitting | Reduce dpo_beta, reduce learning rate, reduce epochs |
reward_margin is negative |
chosen/rejected are swapped in data or label error | Check data, verify -100 masking in labels |
| OOM (out of memory) | DPO performs 2 forwards (policy + reference) | Reduce batch_size, add model.load_precision.mode: int4 |
dpo_loss is NaN |
Learning rate too high or log probability numerical instability | Reduce lr, verify max_grad_norm: 1.0 |
| Data loading error | Required fields missing in JSONL | For raw: check prompt, chosen, rejected. For processed: check chosen_input_ids, rejected_input_ids, chosen_labels, rejected_labels |
Next Steps
- Training Pipeline Guide: For the SFT -> DPO -> ORPO -> RM -> PPO sequence and combination strategies, see 18_training_pipeline.md
- For detailed explanations of each injection strategy, refer to the strategy-specific tutorials:
- Plain LoRA Tutorial
- LoRA MoE Tutorial
- FFN MoE Expert LoRA Tutorial
- Native MoE Expert LoRA Tutorial