18. Training Pipeline (SFT → PPO)

This document explains the complete training sequence for LLM fine-tuning. It covers the order and rationale for each training type (SFT -> DPO/ORPO -> RM -> PPO), and provides concrete guidance on the paths you can combine in practice.

1. The Big Picture

              ┌───────────────────────────────────────────────────┐
              │              Foundational Capabilities             │
              │                                                   │
              │   pretrain (optional)  →  SFT (required start)   │
              │   ────────────────     ─────────────────          │
              │   Language modeling    Instruction-following       │
              │   with raw text       capability                  │
              └─────────────────────┬─────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              ┌──────────┐   ┌──────────┐   ┌──────────┐
              │   DPO    │   │  ORPO    │   │   RM     │
              │          │   │          │   │          │
              │ Direct   │   │ SFT+pref │   │ Reward   │
              │ pref opt │   │ unified  │   │ model    │
              └────┬─────┘   └──────────┘   └────┬─────┘
                   │                              │
                   │        Final model           │
                   │◄─────────────────────────────┘
                   │                              │
                   ▼                              ▼
              ┌──────────┐                  ┌──────────┐
              │Deploy/   │                  │   PPO    │
              │Bench     │                  │ (RLHF)   │
              └──────────┘                  └────┬─────┘
                                                 │
                                            ┌──────────┐
                                            │Deploy/   │
                                            │Bench     │
                                            └──────────┘

2. Role of Each Training Type

Training Type	Input Data	Learning Objective	Analogy
SFT	prompt + response pairs	Learn the pattern "answer like this for this kind of question"	Studying from a textbook
DPO	prompt + chosen/rejected pairs	Increase preference probability for good responses (relative to reference model)	Teacher says "this one is better"
ORPO	prompt + chosen/rejected pairs	SFT + preference learning in one pass (no reference needed)	Studying while getting feedback simultaneously
RM	prompt + chosen/rejected pairs	Predict response quality score (scalar reward)	Training a judge
PPO	prompt + RM scores	Learn a generation strategy that earns high rewards (3 models: policy + RM + reference)	Practicing repeatedly with judge feedback

DPO vs ORPO: Preference Learning Selection Guide

DPO and ORPO are both preference learning methods but are structurally different:

Item	DPO	ORPO
Reference model	Required (adapter disable)	Not required
Forward passes	2 (policy + reference)	1
GPU memory	High (2x forward)	Low (1x forward)
Loss structure	Preference loss only	SFT + preference combined
Handoff compatibility	Incompatible (LoRA frozen -> loss fixed)	Compatible
MoE Phase 0	reward=0 (normal, when LoRA frozen)	sft_loss remains effective

Recommendation: For single GPU + MoE combinations, ORPO is the practical alternative. Choose DPO when you have sufficient GPU memory and are not using Handoff.

3. Recommended Training Order

Path A: SFT -> DPO (Most Common)

SFT (5,000 steps) → DPO (1,500 steps) → Deploy

Step	Purpose	Advantage
SFT	Instruction-following foundation	Model learns how to answer questions
DPO	Preference alignment	Prefers good responses relative to reference model (SFT). No separate RM needed

Advantages: The simplest 2-stage pipeline. Direct preference optimization without RM training.

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --output-dir ./outputs/pipeline/dense_lora_sft 

# Step 2: DPO (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
    --set data.format=raw --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl

Path B: SFT -> ORPO (Practical Alternative to DPO -- Recommended)

SFT (5,000 steps) → ORPO (2,000 steps) → Deploy

Step	Purpose	Advantage
SFT	Instruction-following foundation
ORPO	Combined SFT + preference	No reference model needed. More memory-efficient than DPO

Advantages: More memory-efficient than DPO -- 1 forward pass instead of 2 (policy + reference). ORPO's odds-ratio approach enables preference learning without a reference model.

# Step 1: SFT (same)
# Step 2: ORPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

Warning: Applying ORPO/DPO directly to a base model without SFT can degrade performance. Base models lack instruction-following capability, so preference learning alone cannot establish foundational abilities. For small-scale data (under 10K), always use the SFT -> ORPO/DPO two-stage approach. Consider standalone ORPO only when you have 50K+ high-quality data. Detailed analysis: preference_training_analysis.md

Path C: SFT -> RM -> PPO (Full RLHF)

SFT (5,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy

Step	Purpose	Advantage
SFT	Instruction-following foundation
RM	Reward model training	A trained judge that learns "what makes a good response"
PPO	Reward-maximizing policy	Learns a response generation strategy that earns high scores from the RM

Advantages: The most sophisticated alignment. When the RM learns complex preference criteria, PPO leverages it to generate responses that are both creative and aligned.

Disadvantages: Complex 3-stage process. RM quality directly impacts PPO results. Can be unstable.

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl

# Step 2: RM (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/sft_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final

Path D: SFT -> DPO -> PPO (Initial Alignment with DPO, Fine-Tuning with PPO)

SFT (5,000 steps) → DPO (1,500 steps) → RM (2,000 steps) → PPO (500 steps) → Deploy

Advantages: Rough preference alignment with DPO, then refined fine-tuning with RM+PPO. Combines the strengths of both approaches.

Path E: SFT -> ORPO -> RM -> PPO (ORPO-Based Full RLHF -- Recommended over DPO)

SFT (5,000 steps) → ORPO (2,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy

Step	Purpose	Advantage over DPO Path
SFT	Foundation	Same
ORPO (replaces DPO)	Combined SFT + preference	50% memory savings, Handoff compatible, single forward pass
RM	Reward model	Same
PPO	RLHF policy	Same

Why this path is recommended: - DPO's 2x forward pass risks OOM on a single GPU -> ORPO solves this with 1x forward pass - ORPO works correctly even with MoE + Handoff combinations (no reference model needed) - ORPO's SFT loss maintains foundational capabilities while simultaneously learning preferences

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl

# Step 2: ORPO (replaces DPO)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 3: RM
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/sft_run/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl

# Step 4: PPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/orpo_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final

4. Path Comparison Table

Path	Stages	GPU Memory	Complexity	Handoff Compatible	Recommended Scenario
A: SFT->DPO	2	Medium (2x fwd)	Low	X	GPU headroom + no Handoff
B: SFT->ORPO	2	Low (1x fwd)	Low	O	Single GPU recommended, MoE+Handoff compatible
C: SFT->RM->PPO	3	High	High	O	When sophisticated alignment is needed
D: SFT->DPO->RM->PPO	4	High	Highest	X	Research/experimentation
E: SFT->ORPO->RM->PPO	4	Medium	High	O	Full RLHF + memory efficiency -- recommended over DPO

5. Combining with Injection Strategies

All training types can be freely combined with all 4 injection strategies:

Injection Strategy	SFT	DPO	ORPO	RM	PPO
`dense_lora`	O	O	O	O	O
`mixture_lora`	O	O	O	O	O
`moe_expert_lora`	O	O	O	O	O
`native_moe_expert_lora`	O	O	O	O	O

Note on combining MoE strategies (mixture_lora, moe_expert_lora) with DPO:

In Phase 0, when only ["router"] is trained, DPO's reward_chosen/rejected will show as 0. This is normal -- when LoRA is frozen, policy and reference produce identical outputs. Normal DPO training begins in Phase 1 when LoRA is activated.

Details: 05_dpo_training.md section 10

6. Data Format Mapping

Training Type	Raw Data Task	Required Fields
SFT	`sft`	`text` or `prompt`+`response`
DPO	`prompted_preference`	`prompt`, `chosen`, `rejected`
ORPO	`preference`	`chosen`, `rejected`
RM	`preference`	`chosen`, `rejected`
PPO	`prompt_only`	`prompt`

7. Practical Tips

7.1 Give SFT Enough Time

If SFT is insufficient, subsequent preference learning will be less effective. The model must first learn "how to answer questions" before it can learn "which answers are better."

7.2 DPO vs ORPO Selection Criteria

Criterion	DPO Preferred	ORPO Preferred
GPU memory	Ample	Limited
Reference model	Can manage separately	Want to reduce management burden
Proven methodology	O (more papers/experiments)	Relatively newer method
Want to finish in one step	--	O (unified SFT + preference)

7.3 RM Quality Determines PPO

PPO uses RM scores as rewards. If the RM learns incorrect preferences, PPO will optimize in the wrong direction. Evaluate the RM with benchmarks first before feeding it to PPO.

7.4 Checkpoint Chain (Auto-Detection)

Each stage's checkpoint becomes the next stage's input:

SFT final/ → DPO's model_name
SFT final/ → RM's model_name
SFT final/ + RM final/ → PPO's model_name + reward_model.checkpoint_path

Auto-detection: When you specify an EulerForge checkpoint path as model_name, the original base model is automatically detected and loaded correctly. The process follows the sequence: base model -> LoRA injection -> adapter weight restoration, based on lora_info.json and resolved_config.json.

Automatic injection override: If the previous checkpoint's LoRA structure (lora_r, lora_alpha, target_keywords, start_layer, num_layers, etc.) differs from the current config, it is automatically overridden with the checkpoint's settings. For example, in an SFT (lora_r=48) -> DPO (lora_r=24) pipeline, the DPO stage will automatically use the SFT's lora_r=48. A warning log is printed.

# Specify SFT checkpoint directly as DPO model_name -- base model is auto-resolved
# Injection settings (lora_r, lora_alpha, etc.) are automatically taken from the SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_dpo.yml \
    --set model_name=outputs/sft/final

7.5 Data Quality > Data Quantity

Especially for DPO/ORPO, data where the quality difference between chosen and rejected is clear is important. If the difference is minimal, the model will have difficulty learning the preference direction.

7.6 Practical Constraints on a Single GPU

Model Size	SFT	DPO	PPO	Recommendation
~0.8B (int4)	Possible	Possible	Possible	batch_size=2~4
~2B (int4)	Possible	batch_size=1	Difficult	Increase grad_accum
~4B+	batch_size=1	OOM risk	Not feasible	Recommend switching to ORPO

DPO requires 2 forward passes (policy + reference), so it uses roughly 2x the memory of SFT. For large models on a single GPU, ORPO (1 forward pass) is the practical alternative.

7.7 Recommended Data Scale by Model Size

LoRA (Hu et al., 2021) freezes the base model and trains only low-rank adapters, enabling rapid iteration experiments on small models and domain adaptation on large models even with limited GPU. QLoRA (Dettmers et al., 2023) demonstrated that even 65B models can be fine-tuned on a single 48GB GPU through 4-bit quantization.

LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data on a 65B model, but this was about "surfacing existing capabilities formally" in a large model. Smaller models or scenarios covering multi-turn/safety require more data. Data scale should be adjusted to match model size and data quality, not simply "more is always better."

Model Size	Recommended SFT Data	Recommended Preference Data (DPO/ORPO/RM)	Recommended PPO Prompt Pool	Recommended Starting Point
0.8B	3K~10K	1K~3K	500~2K	`dense_lora + SFT -> ORPO`
2B	8K~30K	2K~8K	1K~3K	`dense_lora + SFT -> ORPO`
3B~4B	15K~60K	5K~20K	2K~5K	`dense_lora/mixture_lora + SFT -> ORPO`
7B~8B	10K~80K	8K~30K	3K~8K	`mixture_lora/moe_expert_lora + SFT -> ORPO -> RM`
MoE type	10K~50K per domain	2K~10K per domain	1K~3K per domain	`moe_expert_lora + phased SFT -> ORPO`

7B~8B note: With high-quality curated data, 10K~50K is sufficient. 80K+ is only justified when including diverse multi-task/multilingual data.

Interpretation Principles - For smaller models, format consistency, correctness, and refusal quality matter more than data volume. - Larger models can achieve format alignment with small amounts of high-quality data, but in enterprise settings, insufficient coverage is a bigger risk, so data diversity should also be ensured. - Preference data can be less than SFT data, but the difference between chosen and rejected must be more distinct. Both DPO and ORPO work better when preference differences are clear.

7.8 Practical Guide by Model Size

0.8B

This range is well-suited for building models that follow formats reliably and concisely. Limit to tasks like classification, triage, field extraction, simple QA, and response drafting.

Injection: dense_lora
SFT data: 3K~10K / Preference data: 1K~3K
Recommended path: SFT -> ORPO / RM/PPO: Skip unless there is a specific reason
Smaller models are more vulnerable to noise and style conflicts than they benefit from more data.
Start your first experiment with around 3K high-quality SFT data, and expand to 6K~10K only if format compliance is low.

2B

The minimum practical threshold for enterprise assistants. Suitable for customer support, document summarization, classification, rule-based responses, and extraction+explanation combined tasks.

Injection: dense_lora, mixture_lora when needed
SFT data: 8K~30K / Preference data: 2K~8K
Recommended path: SFT -> ORPO / DPO: For comparison experiments when GPU headroom is available
Consider mixture_lora when subtasks split into 2~3 categories (e.g., classification/summarization/recommendation drafts)

3B~4B

Capable of one level more complex tasks like document analysis, code assistance, and legal/patent/scientific QA. The best balance for most enterprise PoCs.

Injection: dense_lora or mixture_lora
SFT data: 15K~60K / Preference data: 5K~20K
Recommended path: SFT -> ORPO, add -> RM when needed
Start first experiments with 15K~25K SFT and 5K preference data, then expand after reviewing bench failure types
DPO is possible but ORPO is more realistic on a single GPU due to the additional reference forward pass

7B~8B

Suited for high-value PoCs like advanced domain copilots, long document processing, and scientific/patent/legal/code reasoning.

Injection: mixture_lora or moe_expert_lora
SFT data: 10K~80K / Preference data: 8K~30K
Recommended path: SFT -> ORPO -> RM / PPO: Apply only after RM performance is verified
Better to split experts by task/domain than to pack everything into a single LoRA
Use PPO only for the final 10~20% of tuning

MoE type 4B~8B+

Useful when you want to house multiple specialist behaviors in a single model.

Injection: moe_expert_lora or native_moe_expert_lora
SFT data: 10K~50K per domain / Preference data: 2K~10K per domain
Recommended path: SFT -> ORPO -> RM, add -> PPO when needed
MoE benefits diminish when domain boundaries are blurry
If router training is unstable, check routing quality before attempting PPO

7.9 Injection Strategy Selection Guide

Injection strategy should be chosen based on whether the task structure is uniform or splits into multiple experts, rather than model size alone.

Strategy	Best-Fit Scenario	Recommended Model Size	Default Recommended Path
`dense_lora`	Single domain, quick baseline, single task	0.8B~4B	`SFT -> ORPO`
`mixture_lora`	2~4 mixed subtasks, multi-department assistant	2B~8B	`SFT -> ORPO -> RM`
`moe_expert_lora`	Multi-domain specialist, clear expert separation	4B~8B+	`SFT -> ORPO -> RM -> PPO`
`native_moe_expert_lora`	Advanced experiments leveraging MoE backbone	7B~8B+	`MoE full pipeline`

Recommended Defaults - Always start first experiments with dense_lora. - If a single model must handle classification/extraction/summarization/response drafting simultaneously, consider mixture_lora. - If domain-specific data and evaluation sets are clearly separated, moe_expert_lora is more appropriate. - Rather than forcing MoE onto a small model, first verify that the dense baseline reaches sufficient performance.

7.10 SFT Stage Guide

LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data, but this was "superficial formal surfacing of existing capabilities (Superficial Alignment Hypothesis)" on a 65B model; smaller models require more data. For practical enterprise tuning, 3K~10K+ of uniform and well-formatted SFT data is safe.

Recommended SFT Data Composition - 50~70%: Core task-oriented data - 20~30%: Edge cases / refusal / uncertainty handling - 10~20%: Canonical examples for output format/tone/length control

SFT Stopping Criteria - Check format compliance on 100~300 dev prompts alongside validation loss (don't rely on validation loss alone) - If response length keeps growing but bench scores don't improve, suspect overfitting/style over-injection - Move to the next stage once format errors have largely disappeared and edge case handling is stable

7.11 ORPO / DPO Stage Guide

DPO (Rafailov et al., 2023) directly optimizes preferences through comparison with a reference model, without a separate RM. ORPO (Hong et al., 2024) simultaneously optimizes SFT loss + odds-ratio-based preference penalty without a reference model, in a single stage.

Accurate understanding of ORPO: The key claim of the ORPO paper is performing SFT and preference learning simultaneously rather than in two stages. However, in EulerForge, we use a 2-stage approach where SFT is performed first to establish foundational capabilities, then ORPO adds preference learning. This is a practical choice for stability with small-scale data.

DPO vs ORPO default choice: DPO has undergone more research/validation/community tooling (TRL, NeMo-Aligner, etc.). The reason we recommend ORPO as the default in EulerForge is a practical choice from the perspective of single GPU resource constraints, not a claim of quality superiority. If you have GPU headroom, trying DPO first is also reasonable.

Recommended Selection Criteria - ORPO preferred: Single GPU / large model / MoE family / want to reduce reference management burden - DPO preferred: Ample GPU memory / very clear chosen/rejected differences / validation comparison experiments

Conditions for Good Preference Pairs - Clear difference in correctness - Different in whether evidence is provided - Different in length control and uncertainty expression - Exclude pairs that differ only in typos or punctuation where possible

Recommended Practical Volumes - 0.8B~2B: 1K~8K pairs - 3B~4B: 5K~20K pairs - 7B~8B+: 8K~30K pairs - MoE: 2K~10K pairs per domain

7.12 RM Stage Guide

InstructGPT (Ouyang et al., 2022) established the 3-stage RLHF pipeline of training an RM on human preference data and connecting it to PPO.

When to Use RM - When you need to optimize length, tone, evidence quality, safety, and format in addition to correctness - When ORPO/DPO alone provides insufficient style control - When you actually plan to apply PPO

When Not to Use RM - SFT quality is still unstable / preference data quality is low / no plans to proceed to PPO

Recommended RM Volume: Minimum 5K~20K pairs, stratified split per task, include hard negatives

RM Checkpoint Evaluation: Check reward score distribution, chosen > rejected accuracy, per-task bias, and whether reward hacking occurs (e.g., rewarding length only)

7.13 PPO Stage Guide

PPO is the core RLHF component from InstructGPT, but as DPO points out, it has high implementation/tuning complexity. Approach it as a final fine-tuning stage, not the default path.

Conditions for Applying PPO - SFT + ORPO/DPO results are already sufficiently stable - RM standalone evaluation is meaningful - Optimization goal is clear (e.g., short evidence-based answers, reduce over-confidence, strengthen safe refusal)

When to Avoid PPO: Many format errors / RM quality unverified / insufficient memory/time on single GPU

Recommended Starting Point: Skip for small models when possible, short experiments only for 3B~4B, limited application for 7B~8B/MoE only after RM verification

7.14 Practical Recommended Path Summary

Scenario	Recommended Path
24GB single GPU, 0.8B~2B	`dense_lora + SFT -> ORPO`
48GB single GPU, 3B~4B	`dense_lora/mixture_lora + SFT -> ORPO`
80GB single GPU, 7B~8B	`mixture_lora/moe_expert_lora + SFT -> ORPO -> RM`
80GB+, research experiments	`SFT -> ORPO -> RM -> PPO`
Multi-domain specialist	`MoE full pipeline`, per-expert preference alignment

7.15 Recommended Experiment Order

SFT baseline -- Check format compliance, verbosity, hallucination rate
ORPO initial alignment -- Default choice for single GPU/large model/MoE
DPO comparison experiment -- Only when GPU memory headroom is available
RM -- Only when multi-criteria quality scoring is needed
PPO -- Only as the final stage after RM is verified

7.16 Additional Experiment Checklist

Compare ORPO vs DPO from the same SFT checkpoint
Compare dense_lora vs mixture_lora with the same data
SFT data 25% / 50% / 100% ablation
Before/after removing weak negatives from preference data
Halt PPO if RM standalone quality is low
For MoE families, separately measure router confusion and per-expert bench scores

7.17 References

#	Paper	Key Contribution
1	Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model	Reference-free, simultaneous SFT + preference optimization
2	Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized Language Models	65B single-GPU fine-tuning via 4-bit quantization
3	Hu et al. (2021), LoRA: Low-Rank Adaptation of Large Language Models	Efficient fine-tuning via low-rank adapters
4	Rafailov et al. (2023), Direct Preference Optimization	Direct preference optimization without RM, simplifying RLHF
5	Zhou et al. (2023), LIMA: Less Is More for Alignment	Alignment effectiveness of 1K high-quality data (Superficial Alignment Hypothesis)
6	Ouyang et al. (2022), Training language models to follow instructions with human feedback (InstructGPT)	Established the SFT->RM->PPO 3-stage RLHF pipeline

8. Hands-On Exercises

The following exercises let you experience a complete pipeline from start to finish.

Exercise	Difficulty	Goal	Document
Math/Coding Enhancement	Intermediate	SFT->DPO with GSM8K/code data	20_lab_math_coding.md
Thinking Model	Advanced	Model that outputs `<think>` reasoning process	21_lab_thinking_model.md
Korean Chat Quality	Beginner/Intermediate	SFT->ORPO->Bench with your own data	22_lab_korean_finance_copilot.md
Full Pipeline MoE	Expert	4-domain MoE + SFT->DPO->RM->PPO full pipeline	23_lab_full_pipeline_moe.md

01_dense_lora.md -- Plain LoRA SFT
05_dpo_training.md -- DPO Details
06_orpo_training.md -- ORPO Details
07_rm_training.md -- Reward Model Details
08_ppo_training.md -- PPO (RLHF) Details
CLI Reference -- CLI options by training type

← Prev 17. Scratch Pretraining 19. Data Collection for Labs Next →