7. Reward Model (RM) Training
Overview
Trains a Bradley-Terry reward model.
It uses a RewardHead that extracts a scalar reward from the model's last hidden state,
and trains on chosen/rejected pairs with Loss = -log(σ(r_chosen - r_rejected)).
Purpose: Used as a reward function in PPO/RLHF pipelines
Why You Should Run SFT First
RM should also be trained based on an SFT-completed model. A base model lacks the foundation to understand response quality differences, so the RM cannot learn meaningful rewards.
Correct order: SFT → RM → PPO Run SFT first, then specify that checkpoint (
final/) as the RM'smodel_name.
Prerequisites
- SFT training complete (01_dense_lora.md, etc.)
- EulerForge installation complete (see Getting Started)
- Data preprocessing complete
Data Format
Raw Data (Recommended)
data/dpo_10k_raw.jsonl is data converted to the standard prompted_preference format:
{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Non-preferred response"}
Using data.format=raw will automatically tokenize during training.
Preset
# configs/presets/qwen3.5_0.8b_dense_lora_rm.yml
training:
type: rm
phases:
- step: 0
trainable: ["lora", "attn_lora"]
Running
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=256
Architecture
Input → LM (LoRA) → Last Hidden State → RewardHead(Linear(H,1)) → Scalar Reward
RewardHeadis automatically created and added to the optimizer- Forward pass uses
output_hidden_states=Trueto extract the last layer's hidden state - Uses only the hidden state of the last real token based on attention_mask
Metrics
| Metric | Description |
|---|---|
rm_loss |
Bradley-Terry loss |
reward_chosen |
Average reward for chosen |
reward_rejected |
Average reward for rejected |
reward_margin |
Difference between chosen and rejected |
accuracy |
Proportion where chosen > rejected |