5. DPO Training

Overview

DPO (Direct Preference Optimization) is a training method that aligns models using preferred/rejected response pairs. In EulerForge, DPO operates independently of the injection strategy and can be combined with all strategies (dense_lora, mixture_lora, moe_expert_lora, native_moe_expert_lora).

Suitable for: RLHF alternative, model alignment, preference-based fine-tuning
Key difference from SFT: Single response training vs preferred/rejected pair comparison training
Reference preset: configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml
Prerequisite: Always apply this to a model that has completed SFT training first.

Why SFT Should Come First

DPO/ORPO and other preference training methods are effective only when applied to models that already have instruction-following ability. Applying DPO directly to a base model means learning "which answer is better" while not knowing "how to answer," which can actually decrease benchmark scores.

Correct order: SFT (instruction learning) -> DPO (preference alignment) Incorrect order: Base Model -> DPO (learning preference without basic ability)

Run SFT first, then specify its checkpoint (final/) as the model_name for DPO.

SFT vs DPO Comparison

Item	SFT	DPO
Data format	Single response (`input_ids`, `labels`)	Preferred/rejected pairs (`chosen_`, `rejected_`)
Loss function	Cross-entropy loss	DPO loss (log probability ratio)
Reference model	Not required	Required (substituted by disabling adapters)
Configuration key	`training.type: sft`	`training.type: dpo`, `training.dpo_beta`
Effective batch size	Batch size as-is	Batch size x 2 (chosen + rejected)
Typical learning rate	`1.0e-5`	`5.0e-6` (smaller)

Prerequisites

EulerForge installation complete (see Getting Started)
Data preprocessing complete (data/dpo_10k_raw.jsonl generated)
Understanding of SFT fine-tuning concepts (see injection strategy tutorials)

1. DPO Data Format

DPO training requires preferred (chosen) / rejected response pairs.

Raw Data (Recommended)

Using data.format=raw automatically tokenizes text JSONL at training time:

{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Rejected response"}

data/dpo_10k_raw.jsonl: Standard prompted_preference format (converted in Data Preprocessing)
Prompt tokens are automatically masked with -100.

Processed Data

Pre-tokenized JSONL is also supported:

Field	Type	Description
`chosen_input_ids`	`List[int]`	Token IDs of the preferred response
`chosen_labels`	`List[int]`	Labels of the preferred response (prompt masked with `-100`)
`rejected_input_ids`	`List[int]`	Token IDs of the rejected response
`rejected_labels`	`List[int]`	Labels of the rejected response (prompt masked with `-100`)

2. How DPO Works

Core Idea

DPO compares the log probability ratio between a policy model and a reference model to increase the probability of preferred responses and decrease the probability of rejected responses.

EulerForge's Memory-Efficient Approach

Standard DPO requires loading two models (policy and reference) into memory. EulerForge uses AdapterLayerMixin to have a single model serve both roles.

[One model]
  |
  +-- Policy mode (default): base_layer + LoRA delta -> policy log probabilities
  |
  +-- Reference mode (adapter disabled): base_layer only -> reference log probabilities

Forward Process

[Batch: chosen_1, rejected_1, chosen_2, rejected_2, ...]
       |
       +-- 1) Policy Forward (adapters enabled)
       |     model(x) = base + LoRA/MoE delta
       |     -> policy_chosen_logps (even indices)
       |     -> policy_rejected_logps (odd indices)
       |
       +-- 2) Reference Forward (no_grad)
             [Pipeline DPO] -> restore to initial LoRA state (SFT)
             [Fresh DPO]    -> disable adapters (base only)
             -> ref_chosen_logps
             -> ref_rejected_logps

DPO Loss Function

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
logits = pi_logratios - ref_logratios

loss = -log(sigmoid(beta * logits))

beta (dpo_beta): Controls preference strength. Larger values more strongly enforce the preferred/rejected difference.
sigmoid: Sigmoid function. Normalizes the log probability ratio difference to the 0-1 range.

3. AdapterLayerMixin Mechanism

All adapter modules (LoRALinear, MixtureLoRALinear) inherit from AdapterLayerMixin.

How It Works

class LoRALinear(nn.Module, AdapterLayerMixin):
    def forward(self, x):
        if self.is_adapter_disabled():    # Reference mode
            return self.base_layer(x)     # Return base only

        base_out = self.base_layer(x)
        return base_out + self._lora_forward(x)  # Policy mode

Disable Behavior by Strategy

Adapter Module	Behavior When Disabled
`LoRALinear`	Returns `base_layer(x)` (skips LoRA delta)
`MixtureLoRALinear`	Returns `base_layer(x)` (skips router + experts)

Reference Forward Code

# Pipeline DPO (SFT->DPO): uses initial LoRA state (SFT) as reference
ref_ctx = (_use_reference_lora(model, ref_lora_sd)
           if ref_lora_sd is not None
           else disable_adapter_layers(model))

with torch.no_grad():
    with ref_ctx:
        ref_outputs = model(input_ids=input_ids, attention_mask=attention_mask)

Pipeline DPO: _use_reference_lora(model, ref_sd) -- temporarily restores to SFT weights
Fresh DPO: disable_adapter_layers(model) -- uses base model as reference
torch.no_grad(): No gradients needed for reference model (saves memory)
The current LoRA state is automatically restored when the context manager exits.

Note: The same reference mechanism is applied in PPO's KL penalty calculation.

4. Switching from SFT to DPO

When converting an SFT preset to DPO, the changes required are minimal. The injection and moe sections remain identical.

Summary of Changes

 training:
-  type: sft
+  type: dpo
+  dpo_beta: 0.1
-  lr: 1.0e-5
+  lr: 5.0e-6
-  batch_size: 4
+  batch_size: 2
-  grad_accum_steps: 4
+  grad_accum_steps: 8
-  warmup_steps: 200
+  warmup_steps: 100

Why These Changes?

Change	Reason
`type: dpo`	Activates DPO loss function and reference model logic
`dpo_beta: 0.1`	Preference strength parameter (DPO-specific)
Lower `lr`	DPO fine-tunes an already-trained model, so a smaller learning rate is needed
Lower `batch_size`	DPO processes 2x tokens per batch (chosen + rejected), so this saves VRAM
Higher `grad_accum_steps`	Maintains effective batch size (2 x 8 = 16 ~ 4 x 4)
Lower `warmup_steps`	DPO starts from an already-SFT'd model, so less warmup is needed

5. DPO-Specific Settings

`dpo_beta` Parameter

training:
  type: dpo
  dpo_beta: 0.1    # Range: 0.05 - 0.5 (typically 0.1)

Value	Effect
`0.05`	Weak preference enforcement. Stays close to reference model. Low divergence risk.
`0.1`	Standard value. Appropriate balance for most cases.
`0.5`	Strong preference enforcement. Widens preferred/rejected gap. Overfitting risk.

If dpo_beta is too small: Model barely changes (stays similar to reference model)
If dpo_beta is too large: Overfits to preference data, reduces diversity

6. Full Configuration File Walkthrough

Full contents of configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml:

# -- Model Info --
device: cuda:0                              # GPU device
backbone: qwen3                             # Backbone adapter: Qwen3Adapter
model_name: Qwen/Qwen3.5-0.8B-Base           # HuggingFace model ID

# -- Injection Settings (same as SFT) --
injection:
  strategy: moe_expert_lora             # Injection strategy (same as SFT)
  lora_r: 48                                # LoRA rank
  lora_alpha: 96                            # Scaling factor (96/48 = 2.0)
  lora_dropout: 0.05                        # LoRA dropout
  num_experts: 4                            # MoE expert count
  top_k: 2                                  # Active experts per token
  target_keywords: [gate_proj, up_proj, down_proj]  # FFN targets
  start_layer: 0                            # Starting layer
  num_layers: 0                             # 0 = all
  attn_lora:                                # Attention LoRA
    enabled: true
    keywords: [q_proj, v_proj]

# -- MoE Stability Settings (same as SFT) --
moe:
  router_z_loss_coef: 0.001                 # z-loss: prevents logit overflow
  load_balance:
    type: aux_loss                          # Auxiliary loss-based load balancing
    aux_loss_coef: 0.01                     # Auxiliary loss weight
  router_dtype: float32                     # Router precision

# -- Training Settings (DPO-specific changes) --
training:
  type: dpo                                 # [DPO] Training type
  dpo_beta: 0.1                             # [DPO] Preference strength parameter
  phases:                                   # 3-phase (same structure as SFT)
    - step: 0                               # Phase 0: Router warmup
      trainable: ["router"]
    - step: 2000                            # Phase 1: LoRA training
      trainable: ["lora", "attn_lora"]
    - step: 8000                            # Phase 2: Full unfreeze
      trainable: ["lora", "attn_lora", "router", "base_ffn"]
      base_ffn_keywords: ["gate_proj", "up_proj", "down_proj"]
  lr: 5.0e-6                               # [DPO] Lower learning rate than SFT (1e-5)
  weight_decay: 0.01                        # Weight decay
  warmup_steps: 100                         # [DPO] Shorter warmup than SFT (200)
  max_train_steps: 15000                    # Maximum training steps
  batch_size: 2                             # [DPO] Smaller batch than SFT (4) due to chosen+rejected
  grad_accum_steps: 8                       # [DPO] Larger accumulation than SFT (4) to maintain effective batch
  max_grad_norm: 1.0                        # Gradient clipping
  log_steps: 50                             # Logging interval
  save_steps: 1000                          # Checkpoint save interval
  val_steps: 500                            # Validation interval

7. Running

Basic Execution

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512

Starting DPO from an SFT Checkpoint (Recommended)

# Step 1: SFT training
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

# Step 2: DPO training from SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512 \
    --set model_name=/path/to/sft_checkpoint

Automatic Reference Model Detection: When starting DPO from an SFT checkpoint, the initial LoRA state (SFT model) is automatically used as the reference. If the initial loss is ln(2) ~ 0.693, it is normal. If the loss is greater than 0.693, there is a problem with the reference setup.

Adjusting dpo_beta

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512 \
    --set training.dpo_beta=0.05    # Conservative alignment

Preflight Check

eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --preflight

8. Interpreting DPO Metrics

The following metrics are printed in logs during DPO training.

Metric	Meaning	Desired Trend
`dpo_loss`	DPO loss value	Decreasing
`reward_chosen`	Reward for preferred response	Increasing
`reward_rejected`	Reward for rejected response	Decreasing or stable
`reward_margin`	`reward_chosen - reward_rejected`	Increasing toward positive
`accuracy`	Ratio where preferred reward > rejected reward	Increasing (good at 0.7-0.8)

Metric Interpretation Guide

Good training:
  dpo_loss: 0.69 -> 0.45 (decreasing)
  reward_margin: 0.0 -> 1.5 (increasing toward positive)
  accuracy: 0.5 -> 0.75 (increasing)

Warning signs:
  accuracy > 0.95         -> possible overfitting (reduce dpo_beta)
  reward_margin < 0       -> model prefers rejected responses (check data or beta)
  reward_margin > 5       -> excessive deviation from reference (overfitting, reduce lr/steps)
  dpo_loss diverging      -> learning rate too high

9. Combining with Other Injection Strategies

DPO can be combined with all injection strategies. Only the training section needs to change.

Plain LoRA + DPO

injection:
  strategy: dense_lora         # Injection strategy
  # ... (Plain LoRA settings)

training:
  type: dpo                    # Change to DPO
  dpo_beta: 0.1
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]    # Single phase
  lr: 5.0e-6
  batch_size: 2
  grad_accum_steps: 8

LoRA MoE + DPO

injection:
  strategy: mixture_lora           # Injection strategy
  # ... (LoRA MoE settings)

training:
  type: dpo                    # Change to DPO
  dpo_beta: 0.1
  phases:
    - step: 0
      trainable: ["router"]               # 2-phase
    - step: 2000
      trainable: ["lora", "attn_lora"]
  lr: 5.0e-6
  batch_size: 2
  grad_accum_steps: 8

Key point: The phase schedule depends on the injection strategy. The same phase structure is used regardless of whether DPO is used. What changes are the training parameters: type, dpo_beta, lr, batch_size, grad_accum_steps, etc.

10. Phase 0 Router-Only and DPO Metrics

When Phase 0 trains only ["router"] in MoE strategies, DPO metrics appear as follows:

[Phase0] reward_chosen: 0.0000 | reward_rejected: 0.0000 | reward_margin: 0.0000 | accuracy: 0.0000 | dpo_loss: 0.6931

This is normal. DPO computes reward as policy_logprob - reference_logprob. In Phase 0, since LoRA is frozen, disable_adapter() and enable_adapter() produce identical outputs. Therefore reward = 0, accuracy = 0, loss = ln(2) ~ 0.6931.

Normal DPO training begins in Phase 1 when LoRA is activated.

11. max_train_steps and grad_accum_steps

max_train_steps is based on micro-steps (forward/backward count). Optimizer steps (weight updates) equal max_train_steps / grad_accum_steps.

training:
  max_train_steps: 1500    # micro-steps = 1500 forward/backward passes
  grad_accum_steps: 8      # 1 update after 8 accumulations
  batch_size: 2            # effective batch = 2 x 8 = 16
  # -> optimizer_steps = 1500 / 8 = 187
  # -> total training data = 1500 x 2 = 3000 samples

In logs, Step 6/187 (micro 50/1500): - Step 6/187 = optimizer step 6 / total 187 - micro 50/1500 = micro step 50 / total 1500

12. Debugging and Troubleshooting

Symptom	Cause	Solution
`reward=0, loss=0.6931` in Phase 0	LoRA frozen -> policy=reference	Normal -- resolves after LoRA activation in Phase 1
`accuracy` stuck at 0.5	`dpo_beta` too small or data quality issue	Increase `dpo_beta` or check data
`accuracy` rapidly converges to 1.0	Overfitting	Reduce `dpo_beta`, reduce learning rate, reduce epochs
`reward_margin` is negative	chosen/rejected are swapped in data or label error	Check data, verify `-100` masking in `labels`
OOM (out of memory)	DPO performs 2 forwards (policy + reference)	Reduce `batch_size`, add `model.load_precision.mode: int4`
`dpo_loss` is NaN	Learning rate too high or log probability numerical instability	Reduce `lr`, verify `max_grad_norm: 1.0`
Data loading error	Required fields missing in JSONL	For raw: check `prompt`, `chosen`, `rejected`. For processed: check `chosen_input_ids`, `rejected_input_ids`, `chosen_labels`, `rejected_labels`

Next Steps

Training Pipeline Guide: For the SFT -> DPO -> ORPO -> RM -> PPO sequence and combination strategies, see 18_training_pipeline.md
For detailed explanations of each injection strategy, refer to the strategy-specific tutorials:
Plain LoRA Tutorial
LoRA MoE Tutorial
FFN MoE Expert LoRA Tutorial
Native MoE Expert LoRA Tutorial

← Prev 4. Native MoE Expert LoRA (Mixtral)6. ORPO Training Next →