Home > EulerForge > Tutorials > 7. Reward Model (RM) Training

7. Reward Model (RM) Training

Overview

Trains a Bradley-Terry reward model. It uses a RewardHead that extracts a scalar reward from the model's last hidden state, and trains on chosen/rejected pairs with Loss = -log(σ(r_chosen - r_rejected)).

Purpose: Used as a reward function in PPO/RLHF pipelines

Why You Should Run SFT First

RM should also be trained based on an SFT-completed model. A base model lacks the foundation to understand response quality differences, so the RM cannot learn meaningful rewards.

Correct order: SFT → RM → PPO Run SFT first, then specify that checkpoint (final/) as the RM's model_name.

Prerequisites

Data Format

data/dpo_10k_raw.jsonl is data converted to the standard prompted_preference format:

{"prompt": "Question content", "chosen": "Preferred response", "rejected": "Non-preferred response"}

Using data.format=raw will automatically tokenize during training.

Preset

# configs/presets/qwen3.5_0.8b_dense_lora_rm.yml
training:
  type: rm
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]

Running

eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_rm.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=256

Architecture

Input → LM (LoRA) → Last Hidden State → RewardHead(Linear(H,1)) → Scalar Reward

Metrics

Metric Description
rm_loss Bradley-Terry loss
reward_chosen Average reward for chosen
reward_rejected Average reward for rejected
reward_margin Difference between chosen and rejected
accuracy Proportion where chosen > rejected