8. PPO (RLHF) Training
Overview
PPO (Proximal Policy Optimization) is reinforcement learning-based fine-tuning (RLHF). The model directly generates responses to prompts, a reward model (RM) scores those responses, and the model learns to generate responses that receive higher scores.
Core Structure of PPO — Who Does What
┌─────────────────────────────────────────────────────────┐
│ 1. Policy Model — the model we are training │
│ = SFT-completed model + LoRA │
│ Role: Receives prompts and generates responses │
│ │
│ 2. Reward Model (RewardHead) │
│ = A "scorer" created through separate RM training │
│ Role: Assigns scores (scalar rewards) to responses │
│ generated by the policy model │
│ │
│ 3. Reference Model │
│ = Policy model with adapters disabled │
│ Role: KL divergence calculation (prevents policy │
│ from diverging too far from original) │
└─────────────────────────────────────────────────────────┘
What Gets Trained?
| Component | Trained? | Description |
|---|---|---|
| Policy Model (LoRA) | Yes (trained) | Learns to generate responses that receive high rewards |
| Reward Model (RewardHead) | No (frozen) | Used in its already-trained state |
| Reference Model | No (frozen) | adapter disable = base model in SFT state |
Flow of a Single Step
Prompt → [Policy Model] → Response Generation
↓
[Reward Model] → Score (reward)
↓
[PPO Algorithm] → Policy LoRA Update
│
└─ KL penalty: Keeps [Policy] vs [Reference] difference from growing too large
Are SFT and RM Required?
SFT: Required
| Without SFT | After SFT |
|---|---|
| Policy model generates meaningless responses | Policy model generates structured responses |
| Reward model scores meaningless responses → training is pointless | Reward model can meaningfully compare responses |
| KL divergence diverges rapidly | Stable training |
RM: Effectively Required (random init is meaningless)
In the code, if checkpoint_path: "", a randomly initialized RewardHead is used. This means:
- Random scores are used as rewards → training in a meaningless direction
- Exists only for research/debugging purposes
In practice, you must always specify an RM checkpoint after RM training.
Correct Pipeline
SFT (required) → RM (effectively required) → PPO
Running PPO without RM will work technically, but training with random rewards produces meaningless results.
RM Size and Relationship to Policy
Same-Model RM (Default)
# PPO preset
training:
reward_model:
model_name: Qwen/Qwen3.5-0.8B-Base # Same base as policy
checkpoint_path: outputs/rm_run/final # RM training checkpoint
Behavior: Attaches a RewardHead to the policy model's hidden state for reward computation. No separate model loading required.
Larger Model RM (Advanced — hidden_size mismatch warning)
In principle, a larger RM can provide better rewards than the policy. However, the current EulerForge PPO implementation uses the policy model's hidden state, so:
- The RM's base model must be the same as the policy's base model
- If hidden_size differs, RewardHead loading will result in a size mismatch (warning followed by random init fallback)
Current limitation: RM and policy must use the same base model.
Full Pipeline Order
Step 1: SFT (required)
eulerforge train --preset qwen3.5_0.8b_dense_lora_sft.yml
→ outputs/sft_run/final
Step 2: RM (effectively required)
eulerforge train --preset qwen3.5_0.8b_dense_lora_rm.yml
--set model_name=outputs/sft_run/final
→ outputs/rm_run/final (includes reward_head.pt)
Step 3: PPO
eulerforge train --preset qwen3.5_0.8b_dense_lora_ppo.yml
--set model_name=outputs/sft_run/final ← policy = SFT model
--set training.reward_model.checkpoint_path=outputs/rm_run/final ← RM checkpoint
Role Assignment in Presets
# configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml
# ── Policy Model ──
# This is the model being trained. Specify the SFT-completed checkpoint.
model_name: Qwen/Qwen3.5-0.8B-Base # or --set model_name=outputs/sft_run/final
# ── Reward Model ──
training:
reward_model:
model_name: Qwen/Qwen3.5-0.8B-Base # RM's base model (must match policy)
checkpoint_path: "" # ← RM checkpoint path (empty = random init)
# In practice, always specify:
# checkpoint_path: outputs/rm_run/final
# ── Reference Model ──
# No separate specification needed. Automatically created by disabling the policy model's adapter.
Prerequisites
- SFT training complete — used as the PPO policy model
- RM training complete — used as the PPO reward function (includes
reward_head.pt) - EulerForge installation complete (see Getting Started)
Running PPO without RM will use a random init RewardHead. In this case, training with random rewards is meaningless in practice. Use only for research/debugging purposes.
Data Format
PPO-Specific Data (prompt_only)
PPO requires only prompts. Responses are generated directly by the policy model.
{"prompt": "Please explain the history of artificial intelligence."}
{"prompt": "Tell me how to sort a list in Python."}
data:
format: raw
path: data/ppo_1k_raw.jsonl
task: prompt_only
max_length: 256
Reusing SFT Data
You can also use the prompt column from existing SFT raw data (data/sft_10k_raw.jsonl):
# Extract only prompts from SFT data
python -c "
import json
with open('data/sft_10k_raw.jsonl') as f, open('data/ppo_prompts.jsonl', 'w') as out:
for line in f:
row = json.loads(line)
out.write(json.dumps({'prompt': row['prompt']}) + '\n')
"
Preset
training:
type: ppo
lr: 1.0e-6 # Lower lr than SFT/DPO recommended
ppo:
clip_range: 0.2 # PPO clipping ε
kl_coef: 0.1 # KL penalty coefficient (too large = no learning, too small = divergence)
epochs: 4 # PPO update epochs per batch
max_gen_len: 64 # Maximum generation token count
temperature: 1.0 # Sampling temperature
reward_model:
checkpoint_path: outputs/rm_run/final # RM checkpoint (recommended to always specify)
Running
Full Pipeline (SFT → RM → PPO)
# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
--set data.format=raw --set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--output-dir outputs/ppo_pipeline/01_sft
# Step 2: RM (based on SFT model)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
--set model_name=outputs/ppo_pipeline/01_sft/final \
--set data.format=raw --set data.task=preference \
--set data.path=data/dpo_10k_raw.jsonl \
--output-dir outputs/ppo_pipeline/02_rm
# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
--set model_name=outputs/ppo_pipeline/01_sft/final \
--set training.reward_model.checkpoint_path=outputs/ppo_pipeline/02_rm/final \
--set data.path=data/ppo_1k_raw.jsonl \
--output-dir outputs/ppo_pipeline/03_ppo
Running with SFT Data
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
--set model_name=outputs/sft_run/final \
--set training.reward_model.checkpoint_path=outputs/rm_run/final \
--set data.path=data/ppo_prompts.jsonl
Key Metrics
| Metric | Meaning | Expected |
|---|---|---|
reward_mean |
Average reward | Gradual increase |
kl_divergence |
Policy-reference KL | Stable (increase kl_coef if exploding) |
ppo_loss |
PPO surrogate loss | Decreasing |
entropy |
Generation diversity | Overfitting if too low |
Debugging and Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
reward_mean not changing |
RM random init (checkpoint not specified) | Specify reward_model.checkpoint_path |
reward_mean negative initially |
Low RM quality or hidden_size mismatch | Retrain RM, verify base model match |
kl_divergence explosion |
lr too high or kl_coef too low | Reduce lr, increase kl_coef |
| Generation quality degradation | Reward hacking (exploiting RM weaknesses) | Diversify RM data, increase kl_coef |
| OOM | generate + 3x forward (policy, ref, new) | Reduce batch_size, reduce max_gen_len |
| SDPA tensor size mismatch | right-padding or gradient checkpointing conflict in generate() | Handled automatically (left-padding + temporary gc disable) |
Related Documents
- 18_training_pipeline.md — Full pipeline order
- 07_rm_training.md — RM training (required before PPO)
- 05_dpo_training.md — DPO (alternative to PPO)