20. Lab: Math / Coding Model
Difficulty: Intermediate | GPU: RTX 3090-5090 | Estimated Time: 2-6 hours
Objective
Practice an SFT -> DPO pipeline that specializes Llama-3.2-1B for math problem solving. Perform SFT+DPO with two injection strategies (dense_lora, moe_expert_lora) to compare a total of 4 training runs.
Why Llama-3.2-1B?
- 1B parameters are not sufficient for math reasoning (expressiveness limitation)
- However, experiments complete within 10 minutes on a single RTX 3090-5090 GPU
- The key learning is the pipeline pattern of injecting basic math ability via SFT and then fine-tuning with DPO
- In production, apply the same pipeline to 3B-8B models
Prerequisites
Data Download
Run Section 2 (Math Data) from 19_data_collection.md to prepare the following files:
data/math/sft/math_orca_200000.jsonl # SFT training (200K)
data/math/sft/math_orca_35.jsonl # SFT bench (35 rows)
data/math/dpo/math_step_dpo_10700.jsonl # DPO training (10.7K)
Pipeline Script
# Copy the script to the project root (same level as configs/, data/)
cp examples/run_llama32_pipeline_math.sh ./
chmod +x run_llama32_pipeline_math.sh
Architecture
SFT (Orca-Math 200K)
┌──────────────┐ ┌─────────────────────┐
│ dense_lora │ │ moe_expert_lora │
│ (plain LoRA) │ │ (FFN->MoE+expert) │
└──────┬───────┘ └─────────┬───────────┘
v v
Bench Bench
│ │
v v
┌──────────────┐ ┌─────────────────────┐
│ dense_lora │ │ moe_expert_lora │
│ DPO │ │ DPO (router frozen) │
└──────┬───────┘ └─────────┬───────────┘
v v
Bench Bench
Config Explanation
Dense LoRA SFT (configs/presets/math/llama3.2_dense_lora_sft.yml)
injection:
strategy: dense_lora
lora_r: 64 # Large r -> higher expressiveness (compensating for 1B model)
lora_alpha: 64 # alpha/r = 1.0 (standard)
start_layer: 4 # Skip early layers as they capture general features
num_layers: 0 # All layers after start_layer
training:
lr: 1.5e-4 # Relatively high LR for SFT
max_train_steps: 10000
batch_size: 10
MoE Expert LoRA SFT (configs/presets/math/llama3.2_moe_expert_lora_sft.yml)
injection:
strategy: moe_expert_lora
num_experts: 4 # 4 experts
top_k: 2 # Select 2 per token
training:
phases:
- step: 0
trainable: ["lora", "attn_lora", "router"] # LoRA + Router simultaneously
- step: 5000
trainable: ["lora", "attn_lora", "router", "base_ffn"] # Open base FFN as well
lr: 1.0e-4
max_train_steps: 30000 # MoE requires longer training
Note: MoE is somewhat overkill for a single-domain task like math. All 4 experts may specialize in math, leading to reduced diversity. MoE truly shines when training across multiple domains simultaneously (math+finance+CoT+general) -- see 23_lab_full_pipeline_moe.md for this.
Dense LoRA DPO (configs/presets/math/llama3.2_dense_lora_dpo.yml)
training:
type: dpo
dpo_beta: 0.05 # Small beta -> wider policy deviation tolerance
phases:
- step: 0
trainable: ["lora", "attn_lora"] # base_ffn closed!
lr: 2.0e-6 # ~75x lower than SFT!
max_train_steps: 2500 # ~1 epoch
grad_accum_steps: 16 # effective batch 64
max_grad_norm: 0.5 # Stronger gradient clipping
MoE Expert LoRA DPO (configs/presets/math/llama3.2_moe_expert_lora_dpo.yml)
training:
type: dpo
dpo_beta: 0.20 # More conservative for MoE (higher beta)
phases:
- step: 0
trainable: ["lora", "attn_lora"] # Router FROZEN! (sensitive)
lr: 1.5e-6
max_train_steps: 1600
max_grad_norm: 0.3 # Stronger clipping
Router Freezing: The router trained during MoE SFT is frozen during DPO. If the router overfits to preference data, the existing routing scheme can collapse, degrading overall performance.
DPO Hyperparameter Guide
The fundamental principle for DPO is to configure it very conservatively so as not to disturb SFT results:
| Parameter | SFT Range | Recommended DPO Range | Reason |
|---|---|---|---|
lr |
1e-4 to 1e-3 | 1e-6 to 5e-6 | Preserve SFT gains |
dpo_beta |
-- | 0.05 to 0.3 | Higher values stay closer to the existing policy |
max_grad_norm |
1.0 | 0.3 to 0.5 | Prevent abrupt changes |
base_ffn |
Open in later phases | Closed throughout | Limit change magnitude |
max_train_steps |
10K-30K | 1K-3K | Prevent overfitting (~1 epoch) |
DPO-specific knobs:
dpo_beta: The key value controlling how far the policy can deviate from the reference. Higher beta reduces deviation from the SFT policy. If you want to "gently refine preferences on top of a well-trained SFT model," trybeta: 0.2-0.3first.label_smoothing: Mitigates overfitting to noisy preference data. Robust DPO uses values between 0.0-0.5.- Overfitting symptoms:
reward_marginexploding above 5,accuracystuck at 1.0, abnormal length/diversity degradation on the rejected side
Execution
Method 1: Pipeline Script (Recommended)
# Run only dense_lora (SFT -> DPO)
RUN_DENSE_LORA=true RUN_MOE_EXPERT_LORA=false \
./run_llama32_pipeline_math.sh
# Run only moe_expert_lora
RUN_DENSE_LORA=false RUN_MOE_EXPERT_LORA=true \
./run_llama32_pipeline_math.sh
# Run both
./run_llama32_pipeline_math.sh
The script automatically runs benchmarks at each stage: - SFT bench: Compare SFT model against base model (Llama-3.2-1B) - DPO bench: Compare DPO model against base model
Method 2: Manual Execution
# 1. Dense LoRA SFT
eulerforge train \
--preset configs/presets/math/llama3.2_dense_lora_sft.yml \
--set data.format=raw --set data.task=sft \
--set data.path=data/math/sft/math_orca_200000.jsonl \
--set data.max_length=512 \
--output-dir outputs/math/dense_lora/sft
# 2. Dense LoRA DPO (based on SFT checkpoint)
eulerforge train \
--preset configs/presets/math/llama3.2_dense_lora_dpo.yml \
--set model_name=outputs/math/dense_lora/sft/final \
--set data.format=raw --set data.task=prompted_preference \
--set data.path=data/math/dpo/math_step_dpo_10700.jsonl \
--set data.max_length=1024 \
--output-dir outputs/math/dense_lora/dpo
Output Structure
outputs/math_simple/llama3.2_1b/
├── models/
│ ├── dense_lora/
│ │ ├── sft/final/ # SFT checkpoint
│ │ └── dpo/final/ # DPO checkpoint
│ └── moe_expert_lora/
│ ├── sft/final/
│ └── dpo/final/
└── benchs/
├── dense_lora/
│ ├── sft/ # SFT bench results
│ └── dpo/ # DPO bench results
└── moe_expert_lora/
├── sft/
└── dpo/
Interpreting Results
Expected Patterns
- SFT: Math score improvement compared to the base model (1B pretrained) (bench avg_score 1-2 points -> 3-5 points)
- DPO: Slight improvement or maintenance compared to SFT. DPO is "preference alignment," so expect answer quality/format improvement rather than dramatic gains
- Dense vs MoE: No significant difference for a single domain. MoE advantages emerge in multi-domain settings
Important Notes
- 1B models have fundamental limitations in math reasoning -- low scores are normal
- DPO initial loss starting near ln(2) = 0.693 is normal (reference = SFT model)
- Gradual increase in
reward_marginis a sign that training is progressing well
Next Steps
- If you have experienced why MoE is overkill for a single domain, see 23_lab_full_pipeline_moe.md to mix math+finance+CoT+general SFT and observe the real effect of MoE
- DPO hyperparameter tuning: Use 12_grid_search.md to automatically search
dpo_betaandlr - Apply the same pipeline to larger models (3B, 7B)