7. Reward Model (RM) Training

Overview

Trains a Bradley-Terry reward model. It uses a RewardHead that extracts a scalar reward from the model's last hidden state, and trains on chosen/rejected pairs with Loss = -log(σ(r_chosen - r_rejected)).

Purpose: Used as a reward function in PPO/RLHF pipelines

Why You Should Run SFT First

RM should also be trained based on an SFT-completed model. A base model lacks the foundation to understand response quality differences, so the RM cannot learn meaningful rewards.

Correct order: SFT → RM → PPO Run SFT first, then specify that checkpoint (final/) as the RM's model_name.

Prerequisites

SFT training complete (01_dense_lora.md, etc.)
EulerForge installation complete (see Getting Started)
Data preprocessing complete

Data Format

Raw Data (Recommended)

data/dpo_10k_raw.jsonl is data converted to the standard prompted_preference format:

{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Non-preferred response"}

Using data.format=raw will automatically tokenize during training.

Preset

# configs/presets/qwen3.5_0.8b_dense_lora_rm.yml
training:
  type: rm
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]

Running

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=256

Architecture

Input → LM (LoRA) → Last Hidden State → RewardHead(Linear(H,1)) → Scalar Reward

RewardHead is automatically created and added to the optimizer
Forward pass uses output_hidden_states=True to extract the last layer's hidden state
Uses only the hidden state of the last real token based on attention_mask

Metrics

Metric	Description
`rm_loss`	Bradley-Terry loss
`reward_chosen`	Average reward for chosen
`reward_rejected`	Average reward for rejected
`reward_margin`	Difference between chosen and rejected
`accuracy`	Proportion where chosen > rejected

← Prev 6. ORPO Training 8. PPO (RLHF) Training Next →