Home > EulerForge > Tutorials > 5. DPO Training

5. DPO Training

Overview

DPO (Direct Preference Optimization) is a training method that aligns models using preferred/rejected response pairs. In EulerForge, DPO operates independently of the injection strategy and can be combined with all strategies (dense_lora, mixture_lora, moe_expert_lora, native_moe_expert_lora).

Why SFT Should Come First

DPO/ORPO and other preference training methods are effective only when applied to models that already have instruction-following ability. Applying DPO directly to a base model means learning "which answer is better" while not knowing "how to answer," which can actually decrease benchmark scores.

Correct order: SFT (instruction learning) -> DPO (preference alignment) Incorrect order: Base Model -> DPO (learning preference without basic ability)

Run SFT first, then specify its checkpoint (final/) as the model_name for DPO.

SFT vs DPO Comparison

Item SFT DPO
Data format Single response (input_ids, labels) Preferred/rejected pairs (chosen_*, rejected_*)
Loss function Cross-entropy loss DPO loss (log probability ratio)
Reference model Not required Required (substituted by disabling adapters)
Configuration key training.type: sft training.type: dpo, training.dpo_beta
Effective batch size Batch size as-is Batch size x 2 (chosen + rejected)
Typical learning rate 1.0e-5 5.0e-6 (smaller)

Prerequisites


1. DPO Data Format

DPO training requires preferred (chosen) / rejected response pairs.

Using data.format=raw automatically tokenizes text JSONL at training time:

{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Rejected response"}

Processed Data

Pre-tokenized JSONL is also supported:

Field Type Description
chosen_input_ids List[int] Token IDs of the preferred response
chosen_labels List[int] Labels of the preferred response (prompt masked with -100)
rejected_input_ids List[int] Token IDs of the rejected response
rejected_labels List[int] Labels of the rejected response (prompt masked with -100)

2. How DPO Works

Core Idea

DPO compares the log probability ratio between a policy model and a reference model to increase the probability of preferred responses and decrease the probability of rejected responses.

EulerForge's Memory-Efficient Approach

Standard DPO requires loading two models (policy and reference) into memory. EulerForge uses AdapterLayerMixin to have a single model serve both roles.

[One model]
  |
  +-- Policy mode (default): base_layer + LoRA delta -> policy log probabilities
  |
  +-- Reference mode (adapter disabled): base_layer only -> reference log probabilities

Forward Process

[Batch: chosen_1, rejected_1, chosen_2, rejected_2, ...]
       |
       +-- 1) Policy Forward (adapters enabled)
       |     model(x) = base + LoRA/MoE delta
       |     -> policy_chosen_logps (even indices)
       |     -> policy_rejected_logps (odd indices)
       |
       +-- 2) Reference Forward (no_grad)
             [Pipeline DPO] -> restore to initial LoRA state (SFT)
             [Fresh DPO]    -> disable adapters (base only)
             -> ref_chosen_logps
             -> ref_rejected_logps

DPO Loss Function

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
logits = pi_logratios - ref_logratios

loss = -log(sigmoid(beta * logits))

3. AdapterLayerMixin Mechanism

All adapter modules (LoRALinear, MixtureLoRALinear) inherit from AdapterLayerMixin.

How It Works

class LoRALinear(nn.Module, AdapterLayerMixin):
    def forward(self, x):
        if self.is_adapter_disabled():    # Reference mode
            return self.base_layer(x)     # Return base only

        base_out = self.base_layer(x)
        return base_out + self._lora_forward(x)  # Policy mode

Disable Behavior by Strategy

Adapter Module Behavior When Disabled
LoRALinear Returns base_layer(x) (skips LoRA delta)
MixtureLoRALinear Returns base_layer(x) (skips router + experts)

Reference Forward Code

# Pipeline DPO (SFT->DPO): uses initial LoRA state (SFT) as reference
ref_ctx = (_use_reference_lora(model, ref_lora_sd)
           if ref_lora_sd is not None
           else disable_adapter_layers(model))

with torch.no_grad():
    with ref_ctx:
        ref_outputs = model(input_ids=input_ids, attention_mask=attention_mask)

Note: The same reference mechanism is applied in PPO's KL penalty calculation.


4. Switching from SFT to DPO

When converting an SFT preset to DPO, the changes required are minimal. The injection and moe sections remain identical.

Summary of Changes

 training:
-  type: sft
+  type: dpo
+  dpo_beta: 0.1
-  lr: 1.0e-5
+  lr: 5.0e-6
-  batch_size: 4
+  batch_size: 2
-  grad_accum_steps: 4
+  grad_accum_steps: 8
-  warmup_steps: 200
+  warmup_steps: 100

Why These Changes?

Change Reason
type: dpo Activates DPO loss function and reference model logic
dpo_beta: 0.1 Preference strength parameter (DPO-specific)
Lower lr DPO fine-tunes an already-trained model, so a smaller learning rate is needed
Lower batch_size DPO processes 2x tokens per batch (chosen + rejected), so this saves VRAM
Higher grad_accum_steps Maintains effective batch size (2 x 8 = 16 ~ 4 x 4)
Lower warmup_steps DPO starts from an already-SFT'd model, so less warmup is needed

5. DPO-Specific Settings

dpo_beta Parameter

training:
  type: dpo
  dpo_beta: 0.1    # Range: 0.05 - 0.5 (typically 0.1)
Value Effect
0.05 Weak preference enforcement. Stays close to reference model. Low divergence risk.
0.1 Standard value. Appropriate balance for most cases.
0.5 Strong preference enforcement. Widens preferred/rejected gap. Overfitting risk.

6. Full Configuration File Walkthrough

Full contents of configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml:

# -- Model Info --
device: cuda:0                              # GPU device
backbone: qwen3                             # Backbone adapter: Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base           # HuggingFace model ID

# -- Injection Settings (same as SFT) --
injection:
  strategy: moe_expert_lora             # Injection strategy (same as SFT)
  lora_r: 48                                # LoRA rank
  lora_alpha: 96                            # Scaling factor (96/48 = 2.0)
  lora_dropout: 0.05                        # LoRA dropout
  num_experts: 4                            # MoE expert count
  top_k: 2                                  # Active experts per token
  target_keywords: [gate_proj, up_proj, down_proj]  # FFN targets
  start_layer: 0                            # Starting layer
  num_layers: 0                             # 0 = all
  attn_lora:                                # Attention LoRA
    enabled: true
    keywords: [q_proj, v_proj]

# -- MoE Stability Settings (same as SFT) --
moe:
  router_z_loss_coef: 0.001                 # z-loss: prevents logit overflow
  load_balance:
    type: aux_loss                          # Auxiliary loss-based load balancing
    aux_loss_coef: 0.01                     # Auxiliary loss weight
  router_dtype: float32                     # Router precision

# -- Training Settings (DPO-specific changes) --
training:
  type: dpo                                 # [DPO] Training type
  dpo_beta: 0.1                             # [DPO] Preference strength parameter
  phases:                                   # 3-phase (same structure as SFT)
    - step: 0                               # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                            # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]
    - step: 8000                            # Phase 2: Full unfreeze
      trainable: ["lora", "attn_lora", "router", "base_ffn"]
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
  lr: 5.0e-6                               # [DPO] Lower learning rate than SFT (1e-5)
  weight_decay: 0.01                        # Weight decay
  warmup_steps: 100                         # [DPO] Shorter warmup than SFT (200)
  max_train_steps: 15000                    # Maximum training steps
  batch_size: 2                             # [DPO] Smaller batch than SFT (4) due to chosen+rejected
  grad_accum_steps: 8                       # [DPO] Larger accumulation than SFT (4) to maintain effective batch
  max_grad_norm: 1.0                        # Gradient clipping
  log_steps: 50                             # Logging interval
  save_steps: 1000                          # Checkpoint save interval
  val_steps: 500                            # Validation interval

7. Running

Basic Execution

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512
# Step 1: SFT training
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

# Step 2: DPO training from SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512 \
    --set model_name=/path/to/sft_checkpoint

Automatic Reference Model Detection: When starting DPO from an SFT checkpoint, the initial LoRA state (SFT model) is automatically used as the reference. If the initial loss is ln(2) ~ 0.693, it is normal. If the loss is greater than 0.693, there is a problem with the reference setup.

Adjusting dpo_beta

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512 \
    --set training.dpo_beta=0.05    # Conservative alignment

Preflight Check

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --preflight

8. Interpreting DPO Metrics

The following metrics are printed in logs during DPO training.

Metric Meaning Desired Trend
dpo_loss DPO loss value Decreasing
reward_chosen Reward for preferred response Increasing
reward_rejected Reward for rejected response Decreasing or stable
reward_margin reward_chosen - reward_rejected Increasing toward positive
accuracy Ratio where preferred reward > rejected reward Increasing (good at 0.7-0.8)

Metric Interpretation Guide

Good training:
  dpo_loss: 0.69 -> 0.45 (decreasing)
  reward_margin: 0.0 -> 1.5 (increasing toward positive)
  accuracy: 0.5 -> 0.75 (increasing)

Warning signs:
  accuracy > 0.95         -> possible overfitting (reduce dpo_beta)
  reward_margin < 0       -> model prefers rejected responses (check data or beta)
  reward_margin > 5       -> excessive deviation from reference (overfitting, reduce lr/steps)
  dpo_loss diverging      -> learning rate too high

9. Combining with Other Injection Strategies

DPO can be combined with all injection strategies. Only the training section needs to change.

Plain LoRA + DPO

injection:
  strategy: dense_lora         # Injection strategy
  # ... (Plain LoRA settings)

training:
  type: dpo                    # Change to DPO
  dpo_beta: 0.1
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]    # Single phase
  lr: 5.0e-6
  batch_size: 2
  grad_accum_steps: 8

LoRA MoE + DPO

injection:
  strategy: mixture_lora           # Injection strategy
  # ... (LoRA MoE settings)

training:
  type: dpo                    # Change to DPO
  dpo_beta: 0.1
  phases:
    - step: 0
      trainable: ["router"]               # 2-phase
    - step: 2000
      trainable: ["lora", "attn_lora"]
  lr: 5.0e-6
  batch_size: 2
  grad_accum_steps: 8

Key point: The phase schedule depends on the injection strategy. The same phase structure is used regardless of whether DPO is used. What changes are the training parameters: type, dpo_beta, lr, batch_size, grad_accum_steps, etc.


10. Phase 0 Router-Only and DPO Metrics

When Phase 0 trains only ["router"] in MoE strategies, DPO metrics appear as follows:

[Phase0] reward_chosen: 0.0000 | reward_rejected: 0.0000 | reward_margin: 0.0000 | accuracy: 0.0000 | dpo_loss: 0.6931

This is normal. DPO computes reward as policy_logprob - reference_logprob. In Phase 0, since LoRA is frozen, disable_adapter() and enable_adapter() produce identical outputs. Therefore reward = 0, accuracy = 0, loss = ln(2) ~ 0.6931.

Normal DPO training begins in Phase 1 when LoRA is activated.


11. max_train_steps and grad_accum_steps

max_train_steps is based on micro-steps (forward/backward count). Optimizer steps (weight updates) equal max_train_steps / grad_accum_steps.

training:
  max_train_steps: 1500    # micro-steps = 1500 forward/backward passes
  grad_accum_steps: 8      # 1 update after 8 accumulations
  batch_size: 2            # effective batch = 2 x 8 = 16
  # -> optimizer_steps = 1500 / 8 = 187
  # -> total training data = 1500 x 2 = 3000 samples

In logs, Step 6/187 (micro 50/1500): - Step 6/187 = optimizer step 6 / total 187 - micro 50/1500 = micro step 50 / total 1500


12. Debugging and Troubleshooting

Symptom Cause Solution
reward=0, loss=0.6931 in Phase 0 LoRA frozen -> policy=reference Normal -- resolves after LoRA activation in Phase 1
accuracy stuck at 0.5 dpo_beta too small or data quality issue Increase dpo_beta or check data
accuracy rapidly converges to 1.0 Overfitting Reduce dpo_beta, reduce learning rate, reduce epochs
reward_margin is negative chosen/rejected are swapped in data or label error Check data, verify -100 masking in labels
OOM (out of memory) DPO performs 2 forwards (policy + reference) Reduce batch_size, add model.load_precision.mode: int4
dpo_loss is NaN Learning rate too high or log probability numerical instability Reduce lr, verify max_grad_norm: 1.0
Data loading error Required fields missing in JSONL For raw: check prompt, chosen, rejected. For processed: check chosen_input_ids, rejected_input_ids, chosen_labels, rejected_labels

Next Steps