Home > EulerForge > Tutorials > 8. PPO (RLHF) Training

8. PPO (RLHF) Training

Overview

PPO (Proximal Policy Optimization) is reinforcement learning-based fine-tuning (RLHF). The model directly generates responses to prompts, a reward model (RM) scores those responses, and the model learns to generate responses that receive higher scores.


Core Structure of PPO — Who Does What

┌─────────────────────────────────────────────────────────┐
│ 1. Policy Model — the model we are training             │
│    = SFT-completed model + LoRA                         │
│    Role: Receives prompts and generates responses        │
│                                                         │
│ 2. Reward Model (RewardHead)                            │
│    = A "scorer" created through separate RM training     │
│    Role: Assigns scores (scalar rewards) to responses    │
│          generated by the policy model                   │
│                                                         │
│ 3. Reference Model                                      │
│    = Policy model with adapters disabled                 │
│    Role: KL divergence calculation (prevents policy      │
│          from diverging too far from original)           │
└─────────────────────────────────────────────────────────┘

What Gets Trained?

Component Trained? Description
Policy Model (LoRA) Yes (trained) Learns to generate responses that receive high rewards
Reward Model (RewardHead) No (frozen) Used in its already-trained state
Reference Model No (frozen) adapter disable = base model in SFT state

Flow of a Single Step

Prompt → [Policy Model] → Response Generation
                          ↓
                    [Reward Model] → Score (reward)
                          ↓
                [PPO Algorithm] → Policy LoRA Update
                     │
                     └─ KL penalty: Keeps [Policy] vs [Reference] difference from growing too large

Are SFT and RM Required?

SFT: Required

Without SFT After SFT
Policy model generates meaningless responses Policy model generates structured responses
Reward model scores meaningless responses → training is pointless Reward model can meaningfully compare responses
KL divergence diverges rapidly Stable training

RM: Effectively Required (random init is meaningless)

In the code, if checkpoint_path: "", a randomly initialized RewardHead is used. This means: - Random scores are used as rewards → training in a meaningless direction - Exists only for research/debugging purposes

In practice, you must always specify an RM checkpoint after RM training.

Correct Pipeline

SFT (required) → RM (effectively required) → PPO

Running PPO without RM will work technically, but training with random rewards produces meaningless results.


RM Size and Relationship to Policy

Same-Model RM (Default)

# PPO preset
training:
  reward_model:
    model_name: Qwen/Qwen3.5-0.8B-Base   # Same base as policy
    checkpoint_path: outputs/rm_run/final  # RM training checkpoint

Behavior: Attaches a RewardHead to the policy model's hidden state for reward computation. No separate model loading required.

Larger Model RM (Advanced — hidden_size mismatch warning)

In principle, a larger RM can provide better rewards than the policy. However, the current EulerForge PPO implementation uses the policy model's hidden state, so:

Current limitation: RM and policy must use the same base model.


Full Pipeline Order

Step 1: SFT (required)
    eulerforge train --preset qwen3.5_0.8b_dense_lora_sft.yml
    → outputs/sft_run/final

Step 2: RM (effectively required)
    eulerforge train --preset qwen3.5_0.8b_dense_lora_rm.yml
        --set model_name=outputs/sft_run/final
    → outputs/rm_run/final (includes reward_head.pt)

Step 3: PPO
    eulerforge train --preset qwen3.5_0.8b_dense_lora_ppo.yml
        --set model_name=outputs/sft_run/final              ← policy = SFT model
        --set training.reward_model.checkpoint_path=outputs/rm_run/final  ← RM checkpoint

Role Assignment in Presets

# configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml

# ── Policy Model ──
# This is the model being trained. Specify the SFT-completed checkpoint.
model_name: Qwen/Qwen3.5-0.8B-Base   # or --set model_name=outputs/sft_run/final

# ── Reward Model ──
training:
  reward_model:
    model_name: Qwen/Qwen3.5-0.8B-Base   # RM's base model (must match policy)
    checkpoint_path: ""                    # ← RM checkpoint path (empty = random init)
    # In practice, always specify:
    # checkpoint_path: outputs/rm_run/final

# ── Reference Model ──
# No separate specification needed. Automatically created by disabling the policy model's adapter.

Prerequisites

Running PPO without RM will use a random init RewardHead. In this case, training with random rewards is meaningless in practice. Use only for research/debugging purposes.


Data Format

PPO-Specific Data (prompt_only)

PPO requires only prompts. Responses are generated directly by the policy model.

{"prompt": "Please explain the history of artificial intelligence."}
{"prompt": "Tell me how to sort a list in Python."}
data:
  format: raw
  path: data/ppo_1k_raw.jsonl
  task: prompt_only
  max_length: 256

Reusing SFT Data

You can also use the prompt column from existing SFT raw data (data/sft_10k_raw.jsonl):

# Extract only prompts from SFT data
python -c "
import json
with open('data/sft_10k_raw.jsonl') as f, open('data/ppo_prompts.jsonl', 'w') as out:
    for line in f:
        row = json.loads(line)
        out.write(json.dumps({'prompt': row['prompt']}) + '\n')
"

Preset

training:
  type: ppo
  lr: 1.0e-6               # Lower lr than SFT/DPO recommended
  ppo:
    clip_range: 0.2         # PPO clipping ε
    kl_coef: 0.1            # KL penalty coefficient (too large = no learning, too small = divergence)
    epochs: 4               # PPO update epochs per batch
    max_gen_len: 64          # Maximum generation token count
    temperature: 1.0         # Sampling temperature

  reward_model:
    checkpoint_path: outputs/rm_run/final  # RM checkpoint (recommended to always specify)

Running

Full Pipeline (SFT → RM → PPO)

# Step 1: SFT
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/01_sft

# Step 2: RM (based on SFT model)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set model_name=outputs/ppo_pipeline/01_sft/final \
    --set data.format=raw --set data.task=preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/02_rm

# Step 3: PPO (SFT policy + RM reward)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/ppo_pipeline/01_sft/final \
    --set training.reward_model.checkpoint_path=outputs/ppo_pipeline/02_rm/final \
    --set data.path=data/ppo_1k_raw.jsonl \
    --output-dir outputs/ppo_pipeline/03_ppo

Running with SFT Data

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_ppo.yml \
    --set model_name=outputs/sft_run/final \
    --set training.reward_model.checkpoint_path=outputs/rm_run/final \
    --set data.path=data/ppo_prompts.jsonl

Key Metrics

Metric Meaning Expected
reward_mean Average reward Gradual increase
kl_divergence Policy-reference KL Stable (increase kl_coef if exploding)
ppo_loss PPO surrogate loss Decreasing
entropy Generation diversity Overfitting if too low

Debugging and Troubleshooting

Symptom Cause Solution
reward_mean not changing RM random init (checkpoint not specified) Specify reward_model.checkpoint_path
reward_mean negative initially Low RM quality or hidden_size mismatch Retrain RM, verify base model match
kl_divergence explosion lr too high or kl_coef too low Reduce lr, increase kl_coef
Generation quality degradation Reward hacking (exploiting RM weaknesses) Diversify RM data, increase kl_coef
OOM generate + 3x forward (policy, ref, new) Reduce batch_size, reduce max_gen_len
SDPA tensor size mismatch right-padding or gradient checkpointing conflict in generate() Handled automatically (left-padding + temporary gc disable)