6. ORPO Training
Overview
ORPO is an alternative to DPO that performs preference learning without a reference model. It combines SFT loss and ORPO loss to learn chosen/rejected preferences in a single forward pass.
Advantages: More memory-efficient than DPO since there is no reference model forward pass
Loss Function: Loss = SFT_loss + λ * ORPO_loss
Important: SFT Loss Uses Only Chosen
ORPO's SFT loss computes only the NLL (Negative Log-Likelihood) for chosen responses. Rejected responses are not included in the SFT objective. This follows the design of the original ORPO paper, since training SFT on rejected responses would teach the model to generate bad responses as well.
Why You Should Run SFT First
Important: Applying ORPO directly to a base model may actually degrade benchmark scores.
ORPO learns SFT and preference simultaneously, but base models (only pretrained) lack instruction-following ability. With around 8-10K data points, the SFT component of ORPO alone is insufficient to build foundational capabilities, causing the model to learn "which answer is better" without knowing "how to answer" in the first place.
Correct order: SFT (instruction learning, 5000+ steps) → ORPO (preference alignment, 2000+ steps) Risky order: Base Model → ORPO (small-scale data)
Consider standalone ORPO only when you have 50K+ high-quality preference data (original paper: 60K+). Run SFT first, then specify that checkpoint (
final/) as themodel_namefor ORPO.```bash
Step 1: SFT first
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \ --set data.format=raw --set data.task=sft \ --set data.path=data/sft_10k_raw.jsonl
Step 2: ORPO (based on SFT-completed model)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \ --set model_name=outputs/sft_run/final \ --set data.format=raw --set data.task=preference \ --set data.path=data/dpo_10k_raw.jsonl ```
orpo_lambda Selection Guide
| Scenario | Recommended orpo_lambda |
Reason |
|---|---|---|
| ORPO after SFT + small-scale data | 0.1~0.3 | Prioritize SFT retention, preference as auxiliary |
| ORPO after SFT + medium-scale data | 0.3~0.5 | Balance between SFT and preference |
| Standalone ORPO + large-scale data (50K+) | 0.5~1.0 | Combined SFT learning |
Prerequisites
- EulerForge installation complete (see Getting Started)
- Data preprocessing complete
Data Format
Raw Data (Recommended)
data/dpo_10k_raw.jsonl is data converted to the standard prompted_preference format:
{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Non-preferred response"}
Using data.format=raw will automatically tokenize during training.
Preset
# configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml
training:
type: orpo
orpo_lambda: 1.0 # ORPO term weight (required, positive)
phases:
- step: 0
trainable: ["lora", "attn_lora"]
Running
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=256
Key Parameters
| Parameter | Description | Default |
|---|---|---|
orpo_lambda |
ORPO term weight | Required (recommended: 1.0) |
A larger orpo_lambda strengthens preference learning, while a smaller value makes it closer to SFT.
Metrics
| Metric | Description | Normal Range |
|---|---|---|
train/sft_loss |
Cross-entropy SFT loss | 1.5~4.0 |
train/orpo_loss |
ORPO odds-ratio loss | 0~3.0 |
train/total_loss |
sft_loss + λ * orpo_loss |
2~7.0 |
Note: ORPO loss uses per-token average log-probability. Using sequence-sum log-prob can cause the odds ratio to explode numerically.
Known Issues
NaN loss (all steps)
During prompted_preference preprocessing, max_length truncation can produce rows with 0 response tokens.
These rows cause HuggingFace SFT loss to return NaN, and NaN gradients corrupt model weights.
EulerForge automatically filters out such rows when loading ProcessedDPODataset:
[DPODataset] Filtered N rows with empty labels (response truncated to 0 tokens).
If this warning appears, it is normal behavior. Training continues without needing to regenerate data.
→ Details: 2026-03-orpo-nan-loss-empty-labels.md