8. PPO (RLHF) Training

Overview

PPO (Proximal Policy Optimization) is reinforcement learning-based fine-tuning (RLHF). The model directly generates responses to prompts, a reward model (RM) scores those responses, and the model learns to generate responses that receive higher scores.

Core Structure of PPO — Who Does What

┌─────────────────────────────────────────────────────────┐
│ 1. Policy Model — the model we are training             │
│    = SFT-completed model + LoRA                         │
│    Role: Receives prompts and generates responses        │
│                                                         │
│ 2. Reward Model (RewardHead)                            │
│    = A "scorer" created through separate RM training     │
│    Role: Assigns scores (scalar rewards) to responses    │
│          generated by the policy model                   │
│                                                         │
│ 3. Reference Model                                      │
│    = Policy model with adapters disabled                 │
│    Role: KL divergence calculation (prevents policy      │
│          from diverging too far from original)           │
└─────────────────────────────────────────────────────────┘

What Gets Trained?

Component	Trained?	Description
Policy Model (LoRA)	Yes (trained)	Learns to generate responses that receive high rewards
Reward Model (RewardHead)	No (frozen)	Used in its already-trained state
Reference Model	No (frozen)	adapter disable = base model in SFT state

Flow of a Single Step

Prompt → [Policy Model] → Response Generation
                          ↓
                    [Reward Model] → Score (reward)
                          ↓
                [PPO Algorithm] → Policy LoRA Update
                     │
                     └─ KL penalty: Keeps [Policy] vs [Reference] difference from growing too large

Are SFT and RM Required?

SFT: Required

Without SFT	After SFT
Policy model generates meaningless responses	Policy model generates structured responses
Reward model scores meaningless responses → training is pointless	Reward model can meaningfully compare responses
KL divergence diverges rapidly	Stable training

RM: Effectively Required (random init is meaningless)

In the code, if checkpoint_path: "", a randomly initialized RewardHead is used. This means: - Random scores are used as rewards → training in a meaningless direction - Exists only for research/debugging purposes

In practice, you must always specify an RM checkpoint after RM training.

Correct Pipeline

SFT (required) → RM (effectively required) → PPO

Running PPO without RM will work technically, but training with random rewards produces meaningless results.

RM Size and Relationship to Policy

Same-Model RM (Default)

# PPO preset
training:
  reward_model:
    model_name: Qwen/Qwen3.5-0.8B-Base   # Same base as policy
    checkpoint_path: outputs/rm_run/final  # RM training checkpoint

Behavior: Attaches a RewardHead to the policy model's hidden state for reward computation. No separate model loading required.

Larger Model RM (Advanced — hidden_size mismatch warning)

In principle, a larger RM can provide better rewards than the policy. However, the current EulerForge PPO implementation uses the policy model's hidden state, so:

The RM's base model must be the same as the policy's base model
If hidden_size differs, RewardHead loading will result in a size mismatch (warning followed by random init fallback)

Current limitation: RM and policy must use the same base model.

Full Pipeline Order

Step 1: SFT (required)
    eulerforge train --preset qwen3.5_0.8b_dense_lora_sft.yml
    → outputs/sft_run/final

Step 2: RM (effectively required)
    eulerforge train --preset qwen3.5_0.8b_dense_lora_rm.yml
        --set model_name=outputs/sft_run/final
    → outputs/rm_run/final (includes reward_head.pt)

Step 3: PPO
    eulerforge train --preset qwen3.5_0.8b_dense_lora_ppo.yml
        --set model_name=outputs/sft_run/final              ← policy = SFT model
        --set training.reward_model.checkpoint_path=outputs/rm_run/final  ← RM checkpoint

Role Assignment in Presets

# configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml

# ── Policy Model ──
# This is the model being trained. Specify the SFT-completed checkpoint.
model_name: Qwen/Qwen3.5-0.8B-Base   # or --set model_name=outputs/sft_run/final

# ── Reward Model ──
training:
  reward_model:
    model_name: Qwen/Qwen3.5-0.8B-Base   # RM's base model (must match policy)
    checkpoint_path: ""                    # ← RM checkpoint path (empty = random init)
    # In practice, always specify:
    # checkpoint_path: outputs/rm_run/final

# ── Reference Model ──
# No separate specification needed. Automatically created by disabling the policy model's adapter.

Prerequisites

SFT training complete — used as the PPO policy model
RM training complete — used as the PPO reward function (includes reward_head.pt)
EulerForge installation complete (see Getting Started)

Running PPO without RM will use a random init RewardHead. In this case, training with random rewards is meaningless in practice. Use only for research/debugging purposes.

Data Format

PPO-Specific Data (prompt_only)

PPO requires only prompts. Responses are generated directly by the policy model.

{"prompt": "Please explain the history of artificial intelligence."}
{"prompt": "Tell me how to sort a list in Python."}

data:
  format: raw
  path: data/ppo_1k_raw.jsonl
  task: prompt_only
  max_length: 256

Reusing SFT Data

You can also use the prompt column from existing SFT raw data (data/sft_10k_raw.jsonl):

# Extract only prompts from SFT data
python -c "
import json
with open('data/sft_10k_raw.jsonl') as f, open('data/ppo_prompts.jsonl', 'w') as out:
    for line in f:
        row = json.loads(line)
        out.write(json.dumps({'prompt': row['prompt']}) + '\n')
"

Preset

training:
  type: ppo
  lr: 1.0e-6               # Lower lr than SFT/DPO recommended
  ppo:
    clip_range: 0.2         # PPO clipping ε
    kl_coef: 0.1            # KL penalty coefficient (too large = no learning, too small = divergence)
    epochs: 4               # PPO update epochs per batch
    max_gen_len: 64          # Maximum generation token count
    temperature: 1.0         # Sampling temperature

  reward_model:
    checkpoint_path: outputs/rm_run/final  # RM checkpoint (recommended to always specify)

Running

Full Pipeline (SFT → RM → PPO)

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/01_sft

# Step 2: RM (based on SFT model)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/ppo_pipeline/01_sft/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/02_rm

# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/ppo_pipeline/01_sft/final \
    --set training.reward_model.checkpoint_path=outputs/ppo_pipeline/02_rm/final \
    --set data.path=data/ppo_1k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/03_ppo

Running with SFT Data

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/sft_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final \
    --set data.path=data/ppo_prompts.jsonl

Key Metrics

Metric	Meaning	Expected
`reward_mean`	Average reward	Gradual increase
`kl_divergence`	Policy-reference KL	Stable (increase kl_coef if exploding)
`ppo_loss`	PPO surrogate loss	Decreasing
`entropy`	Generation diversity	Overfitting if too low

Debugging and Troubleshooting

Symptom	Cause	Solution
`reward_mean` not changing	RM random init (checkpoint not specified)	Specify `reward_model.checkpoint_path`
`reward_mean` negative initially	Low RM quality or hidden_size mismatch	Retrain RM, verify base model match
`kl_divergence` explosion	lr too high or kl_coef too low	Reduce lr, increase kl_coef
Generation quality degradation	Reward hacking (exploiting RM weaknesses)	Diversify RM data, increase kl_coef
OOM	generate + 3x forward (policy, ref, new)	Reduce batch_size, reduce max_gen_len
SDPA tensor size mismatch	right-padding or gradient checkpointing conflict in generate()	Handled automatically (left-padding + temporary gc disable)

18_training_pipeline.md — Full pipeline order
07_rm_training.md — RM training (required before PPO)
05_dpo_training.md — DPO (alternative to PPO)

← Prev 7. Reward Model (RM) Training 9. MoE Stability & Validation Next →