6. ORPO Training

Overview

ORPO is an alternative to DPO that performs preference learning without a reference model. It combines SFT loss and ORPO loss to learn chosen/rejected preferences in a single forward pass.

Advantages: More memory-efficient than DPO since there is no reference model forward pass Loss Function: Loss = SFT_loss + λ * ORPO_loss

Important: SFT Loss Uses Only Chosen

ORPO's SFT loss computes only the NLL (Negative Log-Likelihood) for chosen responses. Rejected responses are not included in the SFT objective. This follows the design of the original ORPO paper, since training SFT on rejected responses would teach the model to generate bad responses as well.

Why You Should Run SFT First

Important: Applying ORPO directly to a base model may actually degrade benchmark scores.

ORPO learns SFT and preference simultaneously, but base models (only pretrained) lack instruction-following ability. With around 8-10K data points, the SFT component of ORPO alone is insufficient to build foundational capabilities, causing the model to learn "which answer is better" without knowing "how to answer" in the first place.

Correct order: SFT (instruction learning, 5000+ steps) → ORPO (preference alignment, 2000+ steps) Risky order: Base Model → ORPO (small-scale data)

Consider standalone ORPO only when you have 50K+ high-quality preference data (original paper: 60K+). Run SFT first, then specify that checkpoint (final/) as the model_name for ORPO.

```bash

Step 1: SFT first

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \ --set data.format=raw --set data.task=sft \ --set data.path=data/sft_10k_raw.jsonl

Step 2: ORPO (based on SFT-completed model)

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \ --set model_name=outputs/sft_run/final \ --set data.format=raw --set data.task=preference \ --set data.path=data/dpo_10k_raw.jsonl ```

`orpo_lambda` Selection Guide

Scenario	Recommended `orpo_lambda`	Reason
ORPO after SFT + small-scale data	0.1~0.3	Prioritize SFT retention, preference as auxiliary
ORPO after SFT + medium-scale data	0.3~0.5	Balance between SFT and preference
Standalone ORPO + large-scale data (50K+)	0.5~1.0	Combined SFT learning

Prerequisites

EulerForge installation complete (see Getting Started)
Data preprocessing complete

Data Format

Raw Data (Recommended)

data/dpo_10k_raw.jsonl is data converted to the standard prompted_preference format:

{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Non-preferred response"}

Using data.format=raw will automatically tokenize during training.

Preset

# configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml
training:
  type: orpo
  orpo_lambda: 1.0         # ORPO term weight (required, positive)
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]

Running

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=256

Key Parameters

Parameter	Description	Default
`orpo_lambda`	ORPO term weight	Required (recommended: 1.0)

A larger orpo_lambda strengthens preference learning, while a smaller value makes it closer to SFT.

Metrics

Metric	Description	Normal Range
`train/sft_loss`	Cross-entropy SFT loss	1.5~4.0
`train/orpo_loss`	ORPO odds-ratio loss	0~3.0
`train/total_loss`	`sft_loss + λ * orpo_loss`	2~7.0

Note: ORPO loss uses per-token average log-probability. Using sequence-sum log-prob can cause the odds ratio to explode numerically.

Known Issues

NaN loss (all steps)

During prompted_preference preprocessing, max_length truncation can produce rows with 0 response tokens. These rows cause HuggingFace SFT loss to return NaN, and NaN gradients corrupt model weights.

EulerForge automatically filters out such rows when loading ProcessedDPODataset:

[DPODataset] Filtered N rows with empty labels (response truncated to 0 tokens).

If this warning appears, it is normal behavior. Training continues without needing to regenerate data.

→ Details: 2026-03-orpo-nan-loss-empty-labels.md

← Prev 5. DPO Training 7. Reward Model (RM) Training Next →