Home > EulerForge > Tutorials > 18. Training Pipeline (SFT → PPO)

18. Training Pipeline (SFT → PPO)

This document explains the complete training sequence for LLM fine-tuning. It covers the order and rationale for each training type (SFT -> DPO/ORPO -> RM -> PPO), and provides concrete guidance on the paths you can combine in practice.


1. The Big Picture

              ┌───────────────────────────────────────────────────┐
              │              Foundational Capabilities             │
              │                                                   │
              │   pretrain (optional)  →  SFT (required start)   │
              │   ────────────────     ─────────────────          │
              │   Language modeling    Instruction-following       │
              │   with raw text       capability                  │
              └─────────────────────┬─────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              ┌──────────┐   ┌──────────┐   ┌──────────┐
              │   DPO    │   │  ORPO    │   │   RM     │
              │          │   │          │   │          │
              │ Direct   │   │ SFT+pref │   │ Reward   │
              │ pref opt │   │ unified  │   │ model    │
              └────┬─────┘   └──────────┘   └────┬─────┘
                   │                              │
                   │        Final model           │
                   │◄─────────────────────────────┘
                   │                              │
                   ▼                              ▼
              ┌──────────┐                  ┌──────────┐
              │Deploy/   │                  │   PPO    │
              │Bench     │                  │ (RLHF)   │
              └──────────┘                  └────┬─────┘
                                                 │
                                            ┌──────────┐
                                            │Deploy/   │
                                            │Bench     │
                                            └──────────┘

2. Role of Each Training Type

Training Type Input Data Learning Objective Analogy
SFT prompt + response pairs Learn the pattern "answer like this for this kind of question" Studying from a textbook
DPO prompt + chosen/rejected pairs Increase preference probability for good responses (relative to reference model) Teacher says "this one is better"
ORPO prompt + chosen/rejected pairs SFT + preference learning in one pass (no reference needed) Studying while getting feedback simultaneously
RM prompt + chosen/rejected pairs Predict response quality score (scalar reward) Training a judge
PPO prompt + RM scores Learn a generation strategy that earns high rewards (3 models: policy + RM + reference) Practicing repeatedly with judge feedback

DPO vs ORPO: Preference Learning Selection Guide

DPO and ORPO are both preference learning methods but are structurally different:

Item DPO ORPO
Reference model Required (adapter disable) Not required
Forward passes 2 (policy + reference) 1
GPU memory High (2x forward) Low (1x forward)
Loss structure Preference loss only SFT + preference combined
Handoff compatibility Incompatible (LoRA frozen -> loss fixed) Compatible
MoE Phase 0 reward=0 (normal, when LoRA frozen) sft_loss remains effective

Recommendation: For single GPU + MoE combinations, ORPO is the practical alternative. Choose DPO when you have sufficient GPU memory and are not using Handoff.


Path A: SFT -> DPO (Most Common)

SFT (5,000 steps) → DPO (1,500 steps) → Deploy
Step Purpose Advantage
SFT Instruction-following foundation Model learns how to answer questions
DPO Preference alignment Prefers good responses relative to reference model (SFT). No separate RM needed

Advantages: The simplest 2-stage pipeline. Direct preference optimization without RM training.

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --output-dir ./outputs/pipeline/dense_lora_sft 

# Step 2: DPO (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
    --set data.format=raw --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl

SFT (5,000 steps) → ORPO (2,000 steps) → Deploy
Step Purpose Advantage
SFT Instruction-following foundation
ORPO Combined SFT + preference No reference model needed. More memory-efficient than DPO

Advantages: More memory-efficient than DPO -- 1 forward pass instead of 2 (policy + reference). ORPO's odds-ratio approach enables preference learning without a reference model.

# Step 1: SFT (same)
# Step 2: ORPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

Warning: Applying ORPO/DPO directly to a base model without SFT can degrade performance. Base models lack instruction-following capability, so preference learning alone cannot establish foundational abilities. For small-scale data (under 10K), always use the SFT -> ORPO/DPO two-stage approach. Consider standalone ORPO only when you have 50K+ high-quality data. Detailed analysis: preference_training_analysis.md


Path C: SFT -> RM -> PPO (Full RLHF)

SFT (5,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy
Step Purpose Advantage
SFT Instruction-following foundation
RM Reward model training A trained judge that learns "what makes a good response"
PPO Reward-maximizing policy Learns a response generation strategy that earns high scores from the RM

Advantages: The most sophisticated alignment. When the RM learns complex preference criteria, PPO leverages it to generate responses that are both creative and aligned.

Disadvantages: Complex 3-stage process. RM quality directly impacts PPO results. Can be unstable.

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl

# Step 2: RM (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/sft_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final

Path D: SFT -> DPO -> PPO (Initial Alignment with DPO, Fine-Tuning with PPO)

SFT (5,000 steps) → DPO (1,500 steps) → RM (2,000 steps) → PPO (500 steps) → Deploy

Advantages: Rough preference alignment with DPO, then refined fine-tuning with RM+PPO. Combines the strengths of both approaches.


SFT (5,000 steps) → ORPO (2,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy
Step Purpose Advantage over DPO Path
SFT Foundation Same
ORPO (replaces DPO) Combined SFT + preference 50% memory savings, Handoff compatible, single forward pass
RM Reward model Same
PPO RLHF policy Same

Why this path is recommended: - DPO's 2x forward pass risks OOM on a single GPU -> ORPO solves this with 1x forward pass - ORPO works correctly even with MoE + Handoff combinations (no reference model needed) - ORPO's SFT loss maintains foundational capabilities while simultaneously learning preferences

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl

# Step 2: ORPO (replaces DPO)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 3: RM
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 4: PPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/orpo_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final

4. Path Comparison Table

Path Stages GPU Memory Complexity Handoff Compatible Recommended Scenario
A: SFT->DPO 2 Medium (2x fwd) Low X GPU headroom + no Handoff
B: SFT->ORPO 2 Low (1x fwd) Low O Single GPU recommended, MoE+Handoff compatible
C: SFT->RM->PPO 3 High High O When sophisticated alignment is needed
D: SFT->DPO->RM->PPO 4 High Highest X Research/experimentation
E: SFT->ORPO->RM->PPO 4 Medium High O Full RLHF + memory efficiency -- recommended over DPO

5. Combining with Injection Strategies

All training types can be freely combined with all 4 injection strategies:

Injection Strategy SFT DPO ORPO RM PPO
dense_lora O O O O O
mixture_lora O O O O O
moe_expert_lora O O O O O
native_moe_expert_lora O O O O O

Note on combining MoE strategies (mixture_lora, moe_expert_lora) with DPO:

In Phase 0, when only ["router"] is trained, DPO's reward_chosen/rejected will show as 0. This is normal -- when LoRA is frozen, policy and reference produce identical outputs. Normal DPO training begins in Phase 1 when LoRA is activated.

Details: 05_dpo_training.md section 10


6. Data Format Mapping

Training Type Raw Data Task Required Fields
SFT sft text or prompt+response
DPO prompted_preference prompt, chosen, rejected
ORPO preference chosen, rejected
RM preference chosen, rejected
PPO prompt_only prompt

7. Practical Tips

7.1 Give SFT Enough Time

If SFT is insufficient, subsequent preference learning will be less effective. The model must first learn "how to answer questions" before it can learn "which answers are better."

7.2 DPO vs ORPO Selection Criteria

Criterion DPO Preferred ORPO Preferred
GPU memory Ample Limited
Reference model Can manage separately Want to reduce management burden
Proven methodology O (more papers/experiments) Relatively newer method
Want to finish in one step -- O (unified SFT + preference)

7.3 RM Quality Determines PPO

PPO uses RM scores as rewards. If the RM learns incorrect preferences, PPO will optimize in the wrong direction. Evaluate the RM with benchmarks first before feeding it to PPO.

7.4 Checkpoint Chain (Auto-Detection)

Each stage's checkpoint becomes the next stage's input:

SFT final/ → DPO's model_name
SFT final/ → RM's model_name
SFT final/ + RM final/ → PPO's model_name + reward_model.checkpoint_path

Auto-detection: When you specify an EulerForge checkpoint path as model_name, the original base model is automatically detected and loaded correctly. The process follows the sequence: base model -> LoRA injection -> adapter weight restoration, based on lora_info.json and resolved_config.json.

Automatic injection override: If the previous checkpoint's LoRA structure (lora_r, lora_alpha, target_keywords, start_layer, num_layers, etc.) differs from the current config, it is automatically overridden with the checkpoint's settings. For example, in an SFT (lora_r=48) -> DPO (lora_r=24) pipeline, the DPO stage will automatically use the SFT's lora_r=48. A warning log is printed.

# Specify SFT checkpoint directly as DPO model_name -- base model is auto-resolved
# Injection settings (lora_r, lora_alpha, etc.) are automatically taken from the SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_dpo.yml \
    --set model_name=outputs/sft/final

7.5 Data Quality > Data Quantity

Especially for DPO/ORPO, data where the quality difference between chosen and rejected is clear is important. If the difference is minimal, the model will have difficulty learning the preference direction.

7.6 Practical Constraints on a Single GPU

Model Size SFT DPO PPO Recommendation
~0.8B (int4) Possible Possible Possible batch_size=2~4
~2B (int4) Possible batch_size=1 Difficult Increase grad_accum
~4B+ batch_size=1 OOM risk Not feasible Recommend switching to ORPO

DPO requires 2 forward passes (policy + reference), so it uses roughly 2x the memory of SFT. For large models on a single GPU, ORPO (1 forward pass) is the practical alternative.


LoRA (Hu et al., 2021) freezes the base model and trains only low-rank adapters, enabling rapid iteration experiments on small models and domain adaptation on large models even with limited GPU. QLoRA (Dettmers et al., 2023) demonstrated that even 65B models can be fine-tuned on a single 48GB GPU through 4-bit quantization.

LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data on a 65B model, but this was about "surfacing existing capabilities formally" in a large model. Smaller models or scenarios covering multi-turn/safety require more data. Data scale should be adjusted to match model size and data quality, not simply "more is always better."

Model Size Recommended SFT Data Recommended Preference Data (DPO/ORPO/RM) Recommended PPO Prompt Pool Recommended Starting Point
0.8B 3K~10K 1K~3K 500~2K dense_lora + SFT -> ORPO
2B 8K~30K 2K~8K 1K~3K dense_lora + SFT -> ORPO
3B~4B 15K~60K 5K~20K 2K~5K dense_lora/mixture_lora + SFT -> ORPO
7B~8B 10K~80K 8K~30K 3K~8K mixture_lora/moe_expert_lora + SFT -> ORPO -> RM
MoE type 10K~50K per domain 2K~10K per domain 1K~3K per domain moe_expert_lora + phased SFT -> ORPO

7B~8B note: With high-quality curated data, 10K~50K is sufficient. 80K+ is only justified when including diverse multi-task/multilingual data.

Interpretation Principles - For smaller models, format consistency, correctness, and refusal quality matter more than data volume. - Larger models can achieve format alignment with small amounts of high-quality data, but in enterprise settings, insufficient coverage is a bigger risk, so data diversity should also be ensured. - Preference data can be less than SFT data, but the difference between chosen and rejected must be more distinct. Both DPO and ORPO work better when preference differences are clear.

7.8 Practical Guide by Model Size

0.8B

This range is well-suited for building models that follow formats reliably and concisely. Limit to tasks like classification, triage, field extraction, simple QA, and response drafting.

2B

The minimum practical threshold for enterprise assistants. Suitable for customer support, document summarization, classification, rule-based responses, and extraction+explanation combined tasks.

3B~4B

Capable of one level more complex tasks like document analysis, code assistance, and legal/patent/scientific QA. The best balance for most enterprise PoCs.

7B~8B

Suited for high-value PoCs like advanced domain copilots, long document processing, and scientific/patent/legal/code reasoning.

MoE type 4B~8B+

Useful when you want to house multiple specialist behaviors in a single model.

7.9 Injection Strategy Selection Guide

Injection strategy should be chosen based on whether the task structure is uniform or splits into multiple experts, rather than model size alone.

Strategy Best-Fit Scenario Recommended Model Size Default Recommended Path
dense_lora Single domain, quick baseline, single task 0.8B~4B SFT -> ORPO
mixture_lora 2~4 mixed subtasks, multi-department assistant 2B~8B SFT -> ORPO -> RM
moe_expert_lora Multi-domain specialist, clear expert separation 4B~8B+ SFT -> ORPO -> RM -> PPO
native_moe_expert_lora Advanced experiments leveraging MoE backbone 7B~8B+ MoE full pipeline

Recommended Defaults - Always start first experiments with dense_lora. - If a single model must handle classification/extraction/summarization/response drafting simultaneously, consider mixture_lora. - If domain-specific data and evaluation sets are clearly separated, moe_expert_lora is more appropriate. - Rather than forcing MoE onto a small model, first verify that the dense baseline reaches sufficient performance.

7.10 SFT Stage Guide

LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data, but this was "superficial formal surfacing of existing capabilities (Superficial Alignment Hypothesis)" on a 65B model; smaller models require more data. For practical enterprise tuning, 3K~10K+ of uniform and well-formatted SFT data is safe.

Recommended SFT Data Composition - 50~70%: Core task-oriented data - 20~30%: Edge cases / refusal / uncertainty handling - 10~20%: Canonical examples for output format/tone/length control

SFT Stopping Criteria - Check format compliance on 100~300 dev prompts alongside validation loss (don't rely on validation loss alone) - If response length keeps growing but bench scores don't improve, suspect overfitting/style over-injection - Move to the next stage once format errors have largely disappeared and edge case handling is stable

7.11 ORPO / DPO Stage Guide

DPO (Rafailov et al., 2023) directly optimizes preferences through comparison with a reference model, without a separate RM. ORPO (Hong et al., 2024) simultaneously optimizes SFT loss + odds-ratio-based preference penalty without a reference model, in a single stage.

Accurate understanding of ORPO: The key claim of the ORPO paper is performing SFT and preference learning simultaneously rather than in two stages. However, in EulerForge, we use a 2-stage approach where SFT is performed first to establish foundational capabilities, then ORPO adds preference learning. This is a practical choice for stability with small-scale data.

DPO vs ORPO default choice: DPO has undergone more research/validation/community tooling (TRL, NeMo-Aligner, etc.). The reason we recommend ORPO as the default in EulerForge is a practical choice from the perspective of single GPU resource constraints, not a claim of quality superiority. If you have GPU headroom, trying DPO first is also reasonable.

Recommended Selection Criteria - ORPO preferred: Single GPU / large model / MoE family / want to reduce reference management burden - DPO preferred: Ample GPU memory / very clear chosen/rejected differences / validation comparison experiments

Conditions for Good Preference Pairs - Clear difference in correctness - Different in whether evidence is provided - Different in length control and uncertainty expression - Exclude pairs that differ only in typos or punctuation where possible

Recommended Practical Volumes - 0.8B~2B: 1K~8K pairs - 3B~4B: 5K~20K pairs - 7B~8B+: 8K~30K pairs - MoE: 2K~10K pairs per domain

7.12 RM Stage Guide

InstructGPT (Ouyang et al., 2022) established the 3-stage RLHF pipeline of training an RM on human preference data and connecting it to PPO.

When to Use RM - When you need to optimize length, tone, evidence quality, safety, and format in addition to correctness - When ORPO/DPO alone provides insufficient style control - When you actually plan to apply PPO

When Not to Use RM - SFT quality is still unstable / preference data quality is low / no plans to proceed to PPO

Recommended RM Volume: Minimum 5K~20K pairs, stratified split per task, include hard negatives

RM Checkpoint Evaluation: Check reward score distribution, chosen > rejected accuracy, per-task bias, and whether reward hacking occurs (e.g., rewarding length only)

7.13 PPO Stage Guide

PPO is the core RLHF component from InstructGPT, but as DPO points out, it has high implementation/tuning complexity. Approach it as a final fine-tuning stage, not the default path.

Conditions for Applying PPO - SFT + ORPO/DPO results are already sufficiently stable - RM standalone evaluation is meaningful - Optimization goal is clear (e.g., short evidence-based answers, reduce over-confidence, strengthen safe refusal)

When to Avoid PPO: Many format errors / RM quality unverified / insufficient memory/time on single GPU

Recommended Starting Point: Skip for small models when possible, short experiments only for 3B~4B, limited application for 7B~8B/MoE only after RM verification

Scenario Recommended Path
24GB single GPU, 0.8B~2B dense_lora + SFT -> ORPO
48GB single GPU, 3B~4B dense_lora/mixture_lora + SFT -> ORPO
80GB single GPU, 7B~8B mixture_lora/moe_expert_lora + SFT -> ORPO -> RM
80GB+, research experiments SFT -> ORPO -> RM -> PPO
Multi-domain specialist MoE full pipeline, per-expert preference alignment
  1. SFT baseline -- Check format compliance, verbosity, hallucination rate
  2. ORPO initial alignment -- Default choice for single GPU/large model/MoE
  3. DPO comparison experiment -- Only when GPU memory headroom is available
  4. RM -- Only when multi-criteria quality scoring is needed
  5. PPO -- Only as the final stage after RM is verified

7.16 Additional Experiment Checklist

7.17 References

# Paper Key Contribution
1 Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model Reference-free, simultaneous SFT + preference optimization
2 Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized Language Models 65B single-GPU fine-tuning via 4-bit quantization
3 Hu et al. (2021), LoRA: Low-Rank Adaptation of Large Language Models Efficient fine-tuning via low-rank adapters
4 Rafailov et al. (2023), Direct Preference Optimization Direct preference optimization without RM, simplifying RLHF
5 Zhou et al. (2023), LIMA: Less Is More for Alignment Alignment effectiveness of 1K high-quality data (Superficial Alignment Hypothesis)
6 Ouyang et al. (2022), Training language models to follow instructions with human feedback (InstructGPT) Established the SFT->RM->PPO 3-stage RLHF pipeline

8. Hands-On Exercises

The following exercises let you experience a complete pipeline from start to finish.

Exercise Difficulty Goal Document
Math/Coding Enhancement Intermediate SFT->DPO with GSM8K/code data 20_lab_math_coding.md
Thinking Model Advanced Model that outputs <think> reasoning process 21_lab_thinking_model.md
Korean Chat Quality Beginner/Intermediate SFT->ORPO->Bench with your own data 22_lab_korean_finance_copilot.md
Full Pipeline MoE Expert 4-domain MoE + SFT->DPO->RM->PPO full pipeline 23_lab_full_pipeline_moe.md