Home > EulerForge > Tutorials > 20. Lab: Math / Coding Model

20. Lab: Math / Coding Model

Difficulty: Intermediate | GPU: RTX 3090-5090 | Estimated Time: 2-6 hours

Objective

Practice an SFT -> DPO pipeline that specializes Llama-3.2-1B for math problem solving. Perform SFT+DPO with two injection strategies (dense_lora, moe_expert_lora) to compare a total of 4 training runs.

Why Llama-3.2-1B?


Prerequisites

Data Download

Run Section 2 (Math Data) from 19_data_collection.md to prepare the following files:

data/math/sft/math_orca_200000.jsonl    # SFT training (200K)
data/math/sft/math_orca_35.jsonl        # SFT bench (35 rows)
data/math/dpo/math_step_dpo_10700.jsonl # DPO training (10.7K)

Pipeline Script

# Copy the script to the project root (same level as configs/, data/)
cp examples/run_llama32_pipeline_math.sh ./
chmod +x run_llama32_pipeline_math.sh

Architecture

              SFT (Orca-Math 200K)
  ┌──────────────┐    ┌─────────────────────┐
  │  dense_lora   │    │  moe_expert_lora    │
  │  (plain LoRA) │    │  (FFN->MoE+expert)  │
  └──────┬───────┘    └─────────┬───────────┘
         v                      v
       Bench                  Bench
         │                      │
         v                      v
  ┌──────────────┐    ┌─────────────────────┐
  │  dense_lora   │    │  moe_expert_lora    │
  │  DPO          │    │  DPO (router frozen) │
  └──────┬───────┘    └─────────┬───────────┘
         v                      v
       Bench                  Bench

Config Explanation

Dense LoRA SFT (configs/presets/math/llama3.2_dense_lora_sft.yml)

injection:
  strategy: dense_lora
  lora_r: 64              # Large r -> higher expressiveness (compensating for 1B model)
  lora_alpha: 64           # alpha/r = 1.0 (standard)
  start_layer: 4           # Skip early layers as they capture general features
  num_layers: 0            # All layers after start_layer
training:
  lr: 1.5e-4               # Relatively high LR for SFT
  max_train_steps: 10000
  batch_size: 10

MoE Expert LoRA SFT (configs/presets/math/llama3.2_moe_expert_lora_sft.yml)

injection:
  strategy: moe_expert_lora
  num_experts: 4           # 4 experts
  top_k: 2                 # Select 2 per token
training:
  phases:
    - step: 0
      trainable: ["lora", "attn_lora", "router"]    # LoRA + Router simultaneously
    - step: 5000
      trainable: ["lora", "attn_lora", "router", "base_ffn"]  # Open base FFN as well
  lr: 1.0e-4
  max_train_steps: 30000   # MoE requires longer training

Note: MoE is somewhat overkill for a single-domain task like math. All 4 experts may specialize in math, leading to reduced diversity. MoE truly shines when training across multiple domains simultaneously (math+finance+CoT+general) -- see 23_lab_full_pipeline_moe.md for this.

Dense LoRA DPO (configs/presets/math/llama3.2_dense_lora_dpo.yml)

training:
  type: dpo
  dpo_beta: 0.05           # Small beta -> wider policy deviation tolerance
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]   # base_ffn closed!
  lr: 2.0e-6               # ~75x lower than SFT!
  max_train_steps: 2500    # ~1 epoch
  grad_accum_steps: 16     # effective batch 64
  max_grad_norm: 0.5       # Stronger gradient clipping

MoE Expert LoRA DPO (configs/presets/math/llama3.2_moe_expert_lora_dpo.yml)

training:
  type: dpo
  dpo_beta: 0.20            # More conservative for MoE (higher beta)
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]   # Router FROZEN! (sensitive)
  lr: 1.5e-6
  max_train_steps: 1600
  max_grad_norm: 0.3        # Stronger clipping

Router Freezing: The router trained during MoE SFT is frozen during DPO. If the router overfits to preference data, the existing routing scheme can collapse, degrading overall performance.


DPO Hyperparameter Guide

The fundamental principle for DPO is to configure it very conservatively so as not to disturb SFT results:

Parameter SFT Range Recommended DPO Range Reason
lr 1e-4 to 1e-3 1e-6 to 5e-6 Preserve SFT gains
dpo_beta -- 0.05 to 0.3 Higher values stay closer to the existing policy
max_grad_norm 1.0 0.3 to 0.5 Prevent abrupt changes
base_ffn Open in later phases Closed throughout Limit change magnitude
max_train_steps 10K-30K 1K-3K Prevent overfitting (~1 epoch)

DPO-specific knobs:


Execution

# Run only dense_lora (SFT -> DPO)
RUN_DENSE_LORA=true RUN_MOE_EXPERT_LORA=false \
  ./run_llama32_pipeline_math.sh

# Run only moe_expert_lora
RUN_DENSE_LORA=false RUN_MOE_EXPERT_LORA=true \
  ./run_llama32_pipeline_math.sh

# Run both
./run_llama32_pipeline_math.sh

The script automatically runs benchmarks at each stage: - SFT bench: Compare SFT model against base model (Llama-3.2-1B) - DPO bench: Compare DPO model against base model

Method 2: Manual Execution

# 1. Dense LoRA SFT
eulerforge train \
  --preset configs/presets/math/llama3.2_dense_lora_sft.yml \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/math/sft/math_orca_200000.jsonl \
  --set data.max_length=512 \
  --output-dir outputs/math/dense_lora/sft

# 2. Dense LoRA DPO (based on SFT checkpoint)
eulerforge train \
  --preset configs/presets/math/llama3.2_dense_lora_dpo.yml \
  --set model_name=outputs/math/dense_lora/sft/final \
  --set data.format=raw --set data.task=prompted_preference \
  --set data.path=data/math/dpo/math_step_dpo_10700.jsonl \
  --set data.max_length=1024 \
  --output-dir outputs/math/dense_lora/dpo

Output Structure

outputs/math_simple/llama3.2_1b/
├── models/
│   ├── dense_lora/
│   │   ├── sft/final/       # SFT checkpoint
│   │   └── dpo/final/       # DPO checkpoint
│   └── moe_expert_lora/
│       ├── sft/final/
│       └── dpo/final/
└── benchs/
    ├── dense_lora/
    │   ├── sft/             # SFT bench results
    │   └── dpo/             # DPO bench results
    └── moe_expert_lora/
        ├── sft/
        └── dpo/

Interpreting Results

Expected Patterns

  1. SFT: Math score improvement compared to the base model (1B pretrained) (bench avg_score 1-2 points -> 3-5 points)
  2. DPO: Slight improvement or maintenance compared to SFT. DPO is "preference alignment," so expect answer quality/format improvement rather than dramatic gains
  3. Dense vs MoE: No significant difference for a single domain. MoE advantages emerge in multi-domain settings

Important Notes


Next Steps