20. Lab: Math / Coding Model

Difficulty: Intermediate | GPU: RTX 3090-5090 | Estimated Time: 2-6 hours

Objective

Practice an SFT -> DPO pipeline that specializes Llama-3.2-1B for math problem solving. Perform SFT+DPO with two injection strategies (dense_lora, moe_expert_lora) to compare a total of 4 training runs.

Why Llama-3.2-1B?

1B parameters are not sufficient for math reasoning (expressiveness limitation)
However, experiments complete within 10 minutes on a single RTX 3090-5090 GPU
The key learning is the pipeline pattern of injecting basic math ability via SFT and then fine-tuning with DPO
In production, apply the same pipeline to 3B-8B models

Prerequisites

Data Download

Run Section 2 (Math Data) from 19_data_collection.md to prepare the following files:

data/math/sft/math_orca_200000.jsonl    # SFT training (200K)
data/math/sft/math_orca_35.jsonl        # SFT bench (35 rows)
data/math/dpo/math_step_dpo_10700.jsonl # DPO training (10.7K)

Pipeline Script

# Copy the script to the project root (same level as configs/, data/)
cp examples/run_llama32_pipeline_math.sh ./
chmod +x run_llama32_pipeline_math.sh

Architecture

              SFT (Orca-Math 200K)
  ┌──────────────┐    ┌─────────────────────┐
  │  dense_lora   │    │  moe_expert_lora    │
  │  (plain LoRA) │    │  (FFN->MoE+expert)  │
  └──────┬───────┘    └─────────┬───────────┘
         v                      v
       Bench                  Bench
         │                      │
         v                      v
  ┌──────────────┐    ┌─────────────────────┐
  │  dense_lora   │    │  moe_expert_lora    │
  │  DPO          │    │  DPO (router frozen) │
  └──────┬───────┘    └─────────┬───────────┘
         v                      v
       Bench                  Bench

Config Explanation

Dense LoRA SFT (`configs/presets/math/llama3.2_dense_lora_sft.yml`)

injection:
  strategy: dense_lora
  lora_r: 64              # Large r -> higher expressiveness (compensating for 1B model)
  lora_alpha: 64           # alpha/r = 1.0 (standard)
  start_layer: 4           # Skip early layers as they capture general features
  num_layers: 0            # All layers after start_layer
training:
  lr: 1.5e-4               # Relatively high LR for SFT
  max_train_steps: 10000
  batch_size: 10

MoE Expert LoRA SFT (`configs/presets/math/llama3.2_moe_expert_lora_sft.yml`)

injection:
  strategy: moe_expert_lora
  num_experts: 4           # 4 experts
  top_k: 2                 # Select 2 per token
training:
  phases:
    - step: 0
      trainable: ["lora", "attn_lora", "router"]    # LoRA + Router simultaneously
    - step: 5000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # Open base FFN as well
  lr: 1.0e-4
  max_train_steps: 30000   # MoE requires longer training

Note: MoE is somewhat overkill for a single-domain task like math. All 4 experts may specialize in math, leading to reduced diversity. MoE truly shines when training across multiple domains simultaneously (math+finance+CoT+general) -- see 23_lab_full_pipeline_moe.md for this.

Dense LoRA DPO (`configs/presets/math/llama3.2_dense_lora_dpo.yml`)

training:
  type: dpo
  dpo_beta: 0.05           # Small beta -> wider policy deviation tolerance
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]   # base_ffn closed!
  lr: 2.0e-6               # ~75x lower than SFT!
  max_train_steps: 2500    # ~1 epoch
  grad_accum_steps: 16     # effective batch 64
  max_grad_norm: 0.5       # Stronger gradient clipping

MoE Expert LoRA DPO (`configs/presets/math/llama3.2_moe_expert_lora_dpo.yml`)

training:
  type: dpo
  dpo_beta: 0.20            # More conservative for MoE (higher beta)
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]   # Router FROZEN! (sensitive)
  lr: 1.5e-6
  max_train_steps: 1600
  max_grad_norm: 0.3        # Stronger clipping

Router Freezing: The router trained during MoE SFT is frozen during DPO. If the router overfits to preference data, the existing routing scheme can collapse, degrading overall performance.

DPO Hyperparameter Guide

The fundamental principle for DPO is to configure it very conservatively so as not to disturb SFT results:

Parameter	SFT Range	Recommended DPO Range	Reason
`lr`	1e-4 to 1e-3	1e-6 to 5e-6	Preserve SFT gains
`dpo_beta`	--	0.05 to 0.3	Higher values stay closer to the existing policy
`max_grad_norm`	1.0	0.3 to 0.5	Prevent abrupt changes
`base_ffn`	Open in later phases	Closed throughout	Limit change magnitude
`max_train_steps`	10K-30K	1K-3K	Prevent overfitting (~1 epoch)

DPO-specific knobs:

dpo_beta: The key value controlling how far the policy can deviate from the reference. Higher beta reduces deviation from the SFT policy. If you want to "gently refine preferences on top of a well-trained SFT model," try beta: 0.2-0.3 first.
label_smoothing: Mitigates overfitting to noisy preference data. Robust DPO uses values between 0.0-0.5.
Overfitting symptoms: reward_margin exploding above 5, accuracy stuck at 1.0, abnormal length/diversity degradation on the rejected side

Execution

Method 1: Pipeline Script (Recommended)

# Run only dense_lora (SFT -> DPO)
RUN_DENSE_LORA=true RUN_MOE_EXPERT_LORA=false \
  ./run_llama32_pipeline_math.sh

# Run only moe_expert_lora
RUN_DENSE_LORA=false RUN_MOE_EXPERT_LORA=true \
  ./run_llama32_pipeline_math.sh

# Run both
./run_llama32_pipeline_math.sh

The script automatically runs benchmarks at each stage: - SFT bench: Compare SFT model against base model (Llama-3.2-1B) - DPO bench: Compare DPO model against base model

Method 2: Manual Execution

# 1. Dense LoRA SFT
eulerforge train \
  --preset configs/presets/math/llama3.2_dense_lora_sft.yml \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/math/sft/math_orca_200000.jsonl \
  --set data.max_length=512 \
  --output-dir outputs/math/dense_lora/sft

# 2. Dense LoRA DPO (based on SFT checkpoint)
eulerforge train \
  --preset configs/presets/math/llama3.2_dense_lora_dpo.yml \
  --set model_name=outputs/math/dense_lora/sft/final \
  --set data.format=raw --set data.task=prompted_preference \
  --set data.path=data/math/dpo/math_step_dpo_10700.jsonl \
  --set data.max_length=1024 \
  --output-dir outputs/math/dense_lora/dpo

Output Structure

outputs/math_simple/llama3.2_1b/
├── models/
│   ├── dense_lora/
│   │   ├── sft/final/       # SFT checkpoint
│   │   └── dpo/final/       # DPO checkpoint
│   └── moe_expert_lora/
│       ├── sft/final/
│       └── dpo/final/
└── benchs/
    ├── dense_lora/
    │   ├── sft/             # SFT bench results
    │   └── dpo/             # DPO bench results
    └── moe_expert_lora/
        ├── sft/
        └── dpo/

Interpreting Results

Expected Patterns

SFT: Math score improvement compared to the base model (1B pretrained) (bench avg_score 1-2 points -> 3-5 points)
DPO: Slight improvement or maintenance compared to SFT. DPO is "preference alignment," so expect answer quality/format improvement rather than dramatic gains
Dense vs MoE: No significant difference for a single domain. MoE advantages emerge in multi-domain settings

Important Notes

1B models have fundamental limitations in math reasoning -- low scores are normal
DPO initial loss starting near ln(2) = 0.693 is normal (reference = SFT model)
Gradual increase in reward_margin is a sign that training is progressing well

Next Steps

If you have experienced why MoE is overkill for a single domain, see 23_lab_full_pipeline_moe.md to mix math+finance+CoT+general SFT and observe the real effect of MoE
DPO hyperparameter tuning: Use 12_grid_search.md to automatically search dpo_beta and lr
Apply the same pipeline to larger models (3B, 7B)

← Prev 19. Data Collection for Labs 21. Lab: Chain-of-Thought Reasoning Model Next →