18. Training Pipeline (SFT → PPO)
This document explains the complete training sequence for LLM fine-tuning. It covers the order and rationale for each training type (SFT -> DPO/ORPO -> RM -> PPO), and provides concrete guidance on the paths you can combine in practice.
1. The Big Picture
┌───────────────────────────────────────────────────┐
│ Foundational Capabilities │
│ │
│ pretrain (optional) → SFT (required start) │
│ ──────────────── ───────────────── │
│ Language modeling Instruction-following │
│ with raw text capability │
└─────────────────────┬─────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ DPO │ │ ORPO │ │ RM │
│ │ │ │ │ │
│ Direct │ │ SFT+pref │ │ Reward │
│ pref opt │ │ unified │ │ model │
└────┬─────┘ └──────────┘ └────┬─────┘
│ │
│ Final model │
│◄─────────────────────────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│Deploy/ │ │ PPO │
│Bench │ │ (RLHF) │
└──────────┘ └────┬─────┘
│
┌──────────┐
│Deploy/ │
│Bench │
└──────────┘
2. Role of Each Training Type
| Training Type | Input Data | Learning Objective | Analogy |
|---|---|---|---|
| SFT | prompt + response pairs | Learn the pattern "answer like this for this kind of question" | Studying from a textbook |
| DPO | prompt + chosen/rejected pairs | Increase preference probability for good responses (relative to reference model) | Teacher says "this one is better" |
| ORPO | prompt + chosen/rejected pairs | SFT + preference learning in one pass (no reference needed) | Studying while getting feedback simultaneously |
| RM | prompt + chosen/rejected pairs | Predict response quality score (scalar reward) | Training a judge |
| PPO | prompt + RM scores | Learn a generation strategy that earns high rewards (3 models: policy + RM + reference) | Practicing repeatedly with judge feedback |
DPO vs ORPO: Preference Learning Selection Guide
DPO and ORPO are both preference learning methods but are structurally different:
| Item | DPO | ORPO |
|---|---|---|
| Reference model | Required (adapter disable) | Not required |
| Forward passes | 2 (policy + reference) | 1 |
| GPU memory | High (2x forward) | Low (1x forward) |
| Loss structure | Preference loss only | SFT + preference combined |
| Handoff compatibility | Incompatible (LoRA frozen -> loss fixed) | Compatible |
| MoE Phase 0 | reward=0 (normal, when LoRA frozen) | sft_loss remains effective |
Recommendation: For single GPU + MoE combinations, ORPO is the practical alternative. Choose DPO when you have sufficient GPU memory and are not using Handoff.
3. Recommended Training Order
Path A: SFT -> DPO (Most Common)
SFT (5,000 steps) → DPO (1,500 steps) → Deploy
| Step | Purpose | Advantage |
|---|---|---|
| SFT | Instruction-following foundation | Model learns how to answer questions |
| DPO | Preference alignment | Prefers good responses relative to reference model (SFT). No separate RM needed |
Advantages: The simplest 2-stage pipeline. Direct preference optimization without RM training.
# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
--set data.format=raw --set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--output-dir ./outputs/pipeline/dense_lora_sft
# Step 2: DPO (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
--set data.format=raw --set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl
Path B: SFT -> ORPO (Practical Alternative to DPO -- Recommended)
SFT (5,000 steps) → ORPO (2,000 steps) → Deploy
| Step | Purpose | Advantage |
|---|---|---|
| SFT | Instruction-following foundation | |
| ORPO | Combined SFT + preference | No reference model needed. More memory-efficient than DPO |
Advantages: More memory-efficient than DPO -- 1 forward pass instead of 2 (policy + reference). ORPO's odds-ratio approach enables preference learning without a reference model.
# Step 1: SFT (same)
# Step 2: ORPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
--set model_name=outputs/run_YYYYMMDD_HHMMSS/final \
--set data.format=raw --set data.task=preference \
--set data.path=data/dpo_10k_raw.jsonl
Warning: Applying ORPO/DPO directly to a base model without SFT can degrade performance. Base models lack instruction-following capability, so preference learning alone cannot establish foundational abilities. For small-scale data (under 10K), always use the SFT -> ORPO/DPO two-stage approach. Consider standalone ORPO only when you have 50K+ high-quality data. Detailed analysis: preference_training_analysis.md
Path C: SFT -> RM -> PPO (Full RLHF)
SFT (5,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy
| Step | Purpose | Advantage |
|---|---|---|
| SFT | Instruction-following foundation | |
| RM | Reward model training | A trained judge that learns "what makes a good response" |
| PPO | Reward-maximizing policy | Learns a response generation strategy that earns high scores from the RM |
Advantages: The most sophisticated alignment. When the RM learns complex preference criteria, PPO leverages it to generate responses that are both creative and aligned.
Disadvantages: Complex 3-stage process. RM quality directly impacts PPO results. Can be unstable.
# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
--set data.format=raw --set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl
# Step 2: RM (based on SFT checkpoint)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
--set model_name=outputs/sft_run/final \
--set data.format=raw --set data.task=preference \
--set data.path=data/dpo_10k_raw.jsonl
# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
--set model_name=outputs/sft_run/final \
--set training.reward_model.checkpoint_path=outputs/rm_run/final
Path D: SFT -> DPO -> PPO (Initial Alignment with DPO, Fine-Tuning with PPO)
SFT (5,000 steps) → DPO (1,500 steps) → RM (2,000 steps) → PPO (500 steps) → Deploy
Advantages: Rough preference alignment with DPO, then refined fine-tuning with RM+PPO. Combines the strengths of both approaches.
Path E: SFT -> ORPO -> RM -> PPO (ORPO-Based Full RLHF -- Recommended over DPO)
SFT (5,000 steps) → ORPO (2,000 steps) → RM (2,000 steps) → PPO (1,000 steps) → Deploy
| Step | Purpose | Advantage over DPO Path |
|---|---|---|
| SFT | Foundation | Same |
| ORPO (replaces DPO) | Combined SFT + preference | 50% memory savings, Handoff compatible, single forward pass |
| RM | Reward model | Same |
| PPO | RLHF policy | Same |
Why this path is recommended: - DPO's 2x forward pass risks OOM on a single GPU -> ORPO solves this with 1x forward pass - ORPO works correctly even with MoE + Handoff combinations (no reference model needed) - ORPO's SFT loss maintains foundational capabilities while simultaneously learning preferences
# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
--set data.format=raw --set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl
# Step 2: ORPO (replaces DPO)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
--set model_name=outputs/sft_run/final \
--set data.format=raw --set data.task=preference \
--set data.path=data/dpo_10k_raw.jsonl
# Step 3: RM
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
--set model_name=outputs/sft_run/final \
--set data.format=raw --set data.task=preference \
--set data.path=data/dpo_10k_raw.jsonl
# Step 4: PPO
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
--set model_name=outputs/orpo_run/final \
--set training.reward_model.checkpoint_path=outputs/rm_run/final
4. Path Comparison Table
| Path | Stages | GPU Memory | Complexity | Handoff Compatible | Recommended Scenario |
|---|---|---|---|---|---|
| A: SFT->DPO | 2 | Medium (2x fwd) | Low | X | GPU headroom + no Handoff |
| B: SFT->ORPO | 2 | Low (1x fwd) | Low | O | Single GPU recommended, MoE+Handoff compatible |
| C: SFT->RM->PPO | 3 | High | High | O | When sophisticated alignment is needed |
| D: SFT->DPO->RM->PPO | 4 | High | Highest | X | Research/experimentation |
| E: SFT->ORPO->RM->PPO | 4 | Medium | High | O | Full RLHF + memory efficiency -- recommended over DPO |
5. Combining with Injection Strategies
All training types can be freely combined with all 4 injection strategies:
| Injection Strategy | SFT | DPO | ORPO | RM | PPO |
|---|---|---|---|---|---|
dense_lora |
O | O | O | O | O |
mixture_lora |
O | O | O | O | O |
moe_expert_lora |
O | O | O | O | O |
native_moe_expert_lora |
O | O | O | O | O |
Note on combining MoE strategies (mixture_lora, moe_expert_lora) with DPO:
In Phase 0, when only ["router"] is trained, DPO's reward_chosen/rejected will show as 0. This is normal -- when LoRA is frozen, policy and reference produce identical outputs. Normal DPO training begins in Phase 1 when LoRA is activated.
Details: 05_dpo_training.md section 10
6. Data Format Mapping
| Training Type | Raw Data Task | Required Fields |
|---|---|---|
| SFT | sft |
text or prompt+response |
| DPO | prompted_preference |
prompt, chosen, rejected |
| ORPO | preference |
chosen, rejected |
| RM | preference |
chosen, rejected |
| PPO | prompt_only |
prompt |
7. Practical Tips
7.1 Give SFT Enough Time
If SFT is insufficient, subsequent preference learning will be less effective. The model must first learn "how to answer questions" before it can learn "which answers are better."
7.2 DPO vs ORPO Selection Criteria
| Criterion | DPO Preferred | ORPO Preferred |
|---|---|---|
| GPU memory | Ample | Limited |
| Reference model | Can manage separately | Want to reduce management burden |
| Proven methodology | O (more papers/experiments) | Relatively newer method |
| Want to finish in one step | -- | O (unified SFT + preference) |
7.3 RM Quality Determines PPO
PPO uses RM scores as rewards. If the RM learns incorrect preferences, PPO will optimize in the wrong direction. Evaluate the RM with benchmarks first before feeding it to PPO.
7.4 Checkpoint Chain (Auto-Detection)
Each stage's checkpoint becomes the next stage's input:
SFT final/ → DPO's model_name
SFT final/ → RM's model_name
SFT final/ + RM final/ → PPO's model_name + reward_model.checkpoint_path
Auto-detection: When you specify an EulerForge checkpoint path as model_name, the original base model is automatically detected and loaded correctly. The process follows the sequence: base model -> LoRA injection -> adapter weight restoration, based on lora_info.json and resolved_config.json.
Automatic injection override: If the previous checkpoint's LoRA structure (lora_r, lora_alpha, target_keywords, start_layer, num_layers, etc.) differs from the current config, it is automatically overridden with the checkpoint's settings. For example, in an SFT (lora_r=48) -> DPO (lora_r=24) pipeline, the DPO stage will automatically use the SFT's lora_r=48. A warning log is printed.
# Specify SFT checkpoint directly as DPO model_name -- base model is auto-resolved
# Injection settings (lora_r, lora_alpha, etc.) are automatically taken from the SFT checkpoint
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_dpo.yml \
--set model_name=outputs/sft/final
7.5 Data Quality > Data Quantity
Especially for DPO/ORPO, data where the quality difference between chosen and rejected is clear is important. If the difference is minimal, the model will have difficulty learning the preference direction.
7.6 Practical Constraints on a Single GPU
| Model Size | SFT | DPO | PPO | Recommendation |
|---|---|---|---|---|
| ~0.8B (int4) | Possible | Possible | Possible | batch_size=2~4 |
| ~2B (int4) | Possible | batch_size=1 | Difficult | Increase grad_accum |
| ~4B+ | batch_size=1 | OOM risk | Not feasible | Recommend switching to ORPO |
DPO requires 2 forward passes (policy + reference), so it uses roughly 2x the memory of SFT. For large models on a single GPU, ORPO (1 forward pass) is the practical alternative.
7.7 Recommended Data Scale by Model Size
LoRA (Hu et al., 2021) freezes the base model and trains only low-rank adapters, enabling rapid iteration experiments on small models and domain adaptation on large models even with limited GPU. QLoRA (Dettmers et al., 2023) demonstrated that even 65B models can be fine-tuned on a single 48GB GPU through 4-bit quantization.
LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data on a 65B model, but this was about "surfacing existing capabilities formally" in a large model. Smaller models or scenarios covering multi-turn/safety require more data. Data scale should be adjusted to match model size and data quality, not simply "more is always better."
| Model Size | Recommended SFT Data | Recommended Preference Data (DPO/ORPO/RM) | Recommended PPO Prompt Pool | Recommended Starting Point |
|---|---|---|---|---|
| 0.8B | 3K~10K | 1K~3K | 500~2K | dense_lora + SFT -> ORPO |
| 2B | 8K~30K | 2K~8K | 1K~3K | dense_lora + SFT -> ORPO |
| 3B~4B | 15K~60K | 5K~20K | 2K~5K | dense_lora/mixture_lora + SFT -> ORPO |
| 7B~8B | 10K~80K | 8K~30K | 3K~8K | mixture_lora/moe_expert_lora + SFT -> ORPO -> RM |
| MoE type | 10K~50K per domain | 2K~10K per domain | 1K~3K per domain | moe_expert_lora + phased SFT -> ORPO |
7B~8B note: With high-quality curated data, 10K~50K is sufficient. 80K+ is only justified when including diverse multi-task/multilingual data.
Interpretation Principles
- For smaller models, format consistency, correctness, and refusal quality matter more than data volume.
- Larger models can achieve format alignment with small amounts of high-quality data, but in enterprise settings, insufficient coverage is a bigger risk, so data diversity should also be ensured.
- Preference data can be less than SFT data, but the difference between chosen and rejected must be more distinct. Both DPO and ORPO work better when preference differences are clear.
7.8 Practical Guide by Model Size
0.8B
This range is well-suited for building models that follow formats reliably and concisely. Limit to tasks like classification, triage, field extraction, simple QA, and response drafting.
- Injection:
dense_lora - SFT data: 3K~10K / Preference data: 1K~3K
- Recommended path:
SFT -> ORPO/ RM/PPO: Skip unless there is a specific reason - Smaller models are more vulnerable to noise and style conflicts than they benefit from more data.
- Start your first experiment with around 3K high-quality SFT data, and expand to 6K~10K only if format compliance is low.
2B
The minimum practical threshold for enterprise assistants. Suitable for customer support, document summarization, classification, rule-based responses, and extraction+explanation combined tasks.
- Injection:
dense_lora,mixture_lorawhen needed - SFT data: 8K~30K / Preference data: 2K~8K
- Recommended path:
SFT -> ORPO/ DPO: For comparison experiments when GPU headroom is available - Consider
mixture_lorawhen subtasks split into 2~3 categories (e.g., classification/summarization/recommendation drafts)
3B~4B
Capable of one level more complex tasks like document analysis, code assistance, and legal/patent/scientific QA. The best balance for most enterprise PoCs.
- Injection:
dense_loraormixture_lora - SFT data: 15K~60K / Preference data: 5K~20K
- Recommended path:
SFT -> ORPO, add-> RMwhen needed - Start first experiments with 15K~25K SFT and 5K preference data, then expand after reviewing bench failure types
- DPO is possible but ORPO is more realistic on a single GPU due to the additional reference forward pass
7B~8B
Suited for high-value PoCs like advanced domain copilots, long document processing, and scientific/patent/legal/code reasoning.
- Injection:
mixture_loraormoe_expert_lora - SFT data: 10K~80K / Preference data: 8K~30K
- Recommended path:
SFT -> ORPO -> RM/ PPO: Apply only after RM performance is verified - Better to split experts by task/domain than to pack everything into a single LoRA
- Use PPO only for the final 10~20% of tuning
MoE type 4B~8B+
Useful when you want to house multiple specialist behaviors in a single model.
- Injection:
moe_expert_loraornative_moe_expert_lora - SFT data: 10K~50K per domain / Preference data: 2K~10K per domain
- Recommended path:
SFT -> ORPO -> RM, add-> PPOwhen needed - MoE benefits diminish when domain boundaries are blurry
- If router training is unstable, check routing quality before attempting PPO
7.9 Injection Strategy Selection Guide
Injection strategy should be chosen based on whether the task structure is uniform or splits into multiple experts, rather than model size alone.
| Strategy | Best-Fit Scenario | Recommended Model Size | Default Recommended Path |
|---|---|---|---|
dense_lora |
Single domain, quick baseline, single task | 0.8B~4B | SFT -> ORPO |
mixture_lora |
2~4 mixed subtasks, multi-department assistant | 2B~8B | SFT -> ORPO -> RM |
moe_expert_lora |
Multi-domain specialist, clear expert separation | 4B~8B+ | SFT -> ORPO -> RM -> PPO |
native_moe_expert_lora |
Advanced experiments leveraging MoE backbone | 7B~8B+ | MoE full pipeline |
Recommended Defaults
- Always start first experiments with dense_lora.
- If a single model must handle classification/extraction/summarization/response drafting simultaneously, consider mixture_lora.
- If domain-specific data and evaluation sets are clearly separated, moe_expert_lora is more appropriate.
- Rather than forcing MoE onto a small model, first verify that the dense baseline reaches sufficient performance.
7.10 SFT Stage Guide
LIMA (Zhou et al., 2023) showed alignment effectiveness with 1K high-quality data, but this was "superficial formal surfacing of existing capabilities (Superficial Alignment Hypothesis)" on a 65B model; smaller models require more data. For practical enterprise tuning, 3K~10K+ of uniform and well-formatted SFT data is safe.
Recommended SFT Data Composition - 50~70%: Core task-oriented data - 20~30%: Edge cases / refusal / uncertainty handling - 10~20%: Canonical examples for output format/tone/length control
SFT Stopping Criteria - Check format compliance on 100~300 dev prompts alongside validation loss (don't rely on validation loss alone) - If response length keeps growing but bench scores don't improve, suspect overfitting/style over-injection - Move to the next stage once format errors have largely disappeared and edge case handling is stable
7.11 ORPO / DPO Stage Guide
DPO (Rafailov et al., 2023) directly optimizes preferences through comparison with a reference model, without a separate RM. ORPO (Hong et al., 2024) simultaneously optimizes SFT loss + odds-ratio-based preference penalty without a reference model, in a single stage.
Accurate understanding of ORPO: The key claim of the ORPO paper is performing SFT and preference learning simultaneously rather than in two stages. However, in EulerForge, we use a 2-stage approach where SFT is performed first to establish foundational capabilities, then ORPO adds preference learning. This is a practical choice for stability with small-scale data.
DPO vs ORPO default choice: DPO has undergone more research/validation/community tooling (TRL, NeMo-Aligner, etc.). The reason we recommend ORPO as the default in EulerForge is a practical choice from the perspective of single GPU resource constraints, not a claim of quality superiority. If you have GPU headroom, trying DPO first is also reasonable.
Recommended Selection Criteria - ORPO preferred: Single GPU / large model / MoE family / want to reduce reference management burden - DPO preferred: Ample GPU memory / very clear chosen/rejected differences / validation comparison experiments
Conditions for Good Preference Pairs - Clear difference in correctness - Different in whether evidence is provided - Different in length control and uncertainty expression - Exclude pairs that differ only in typos or punctuation where possible
Recommended Practical Volumes - 0.8B~2B: 1K~8K pairs - 3B~4B: 5K~20K pairs - 7B~8B+: 8K~30K pairs - MoE: 2K~10K pairs per domain
7.12 RM Stage Guide
InstructGPT (Ouyang et al., 2022) established the 3-stage RLHF pipeline of training an RM on human preference data and connecting it to PPO.
When to Use RM - When you need to optimize length, tone, evidence quality, safety, and format in addition to correctness - When ORPO/DPO alone provides insufficient style control - When you actually plan to apply PPO
When Not to Use RM - SFT quality is still unstable / preference data quality is low / no plans to proceed to PPO
Recommended RM Volume: Minimum 5K~20K pairs, stratified split per task, include hard negatives
RM Checkpoint Evaluation: Check reward score distribution, chosen > rejected accuracy, per-task bias, and whether reward hacking occurs (e.g., rewarding length only)
7.13 PPO Stage Guide
PPO is the core RLHF component from InstructGPT, but as DPO points out, it has high implementation/tuning complexity. Approach it as a final fine-tuning stage, not the default path.
Conditions for Applying PPO - SFT + ORPO/DPO results are already sufficiently stable - RM standalone evaluation is meaningful - Optimization goal is clear (e.g., short evidence-based answers, reduce over-confidence, strengthen safe refusal)
When to Avoid PPO: Many format errors / RM quality unverified / insufficient memory/time on single GPU
Recommended Starting Point: Skip for small models when possible, short experiments only for 3B~4B, limited application for 7B~8B/MoE only after RM verification
7.14 Practical Recommended Path Summary
| Scenario | Recommended Path |
|---|---|
| 24GB single GPU, 0.8B~2B | dense_lora + SFT -> ORPO |
| 48GB single GPU, 3B~4B | dense_lora/mixture_lora + SFT -> ORPO |
| 80GB single GPU, 7B~8B | mixture_lora/moe_expert_lora + SFT -> ORPO -> RM |
| 80GB+, research experiments | SFT -> ORPO -> RM -> PPO |
| Multi-domain specialist | MoE full pipeline, per-expert preference alignment |
7.15 Recommended Experiment Order
- SFT baseline -- Check format compliance, verbosity, hallucination rate
- ORPO initial alignment -- Default choice for single GPU/large model/MoE
- DPO comparison experiment -- Only when GPU memory headroom is available
- RM -- Only when multi-criteria quality scoring is needed
- PPO -- Only as the final stage after RM is verified
7.16 Additional Experiment Checklist
- Compare ORPO vs DPO from the same SFT checkpoint
- Compare dense_lora vs mixture_lora with the same data
- SFT data 25% / 50% / 100% ablation
- Before/after removing weak negatives from preference data
- Halt PPO if RM standalone quality is low
- For MoE families, separately measure router confusion and per-expert bench scores
7.17 References
| # | Paper | Key Contribution |
|---|---|---|
| 1 | Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model | Reference-free, simultaneous SFT + preference optimization |
| 2 | Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized Language Models | 65B single-GPU fine-tuning via 4-bit quantization |
| 3 | Hu et al. (2021), LoRA: Low-Rank Adaptation of Large Language Models | Efficient fine-tuning via low-rank adapters |
| 4 | Rafailov et al. (2023), Direct Preference Optimization | Direct preference optimization without RM, simplifying RLHF |
| 5 | Zhou et al. (2023), LIMA: Less Is More for Alignment | Alignment effectiveness of 1K high-quality data (Superficial Alignment Hypothesis) |
| 6 | Ouyang et al. (2022), Training language models to follow instructions with human feedback (InstructGPT) | Established the SFT->RM->PPO 3-stage RLHF pipeline |
8. Hands-On Exercises
The following exercises let you experience a complete pipeline from start to finish.
| Exercise | Difficulty | Goal | Document |
|---|---|---|---|
| Math/Coding Enhancement | Intermediate | SFT->DPO with GSM8K/code data | 20_lab_math_coding.md |
| Thinking Model | Advanced | Model that outputs <think> reasoning process |
21_lab_thinking_model.md |
| Korean Chat Quality | Beginner/Intermediate | SFT->ORPO->Bench with your own data | 22_lab_korean_finance_copilot.md |
| Full Pipeline MoE | Expert | 4-domain MoE + SFT->DPO->RM->PPO full pipeline | 23_lab_full_pipeline_moe.md |
Related Documents
- 01_dense_lora.md -- Plain LoRA SFT
- 05_dpo_training.md -- DPO Details
- 06_orpo_training.md -- ORPO Details
- 07_rm_training.md -- Reward Model Details
- 08_ppo_training.md -- PPO (RLHF) Details
- CLI Reference -- CLI options by training type