21. Lab: Chain-of-Thought Reasoning Model
Difficulty: Advanced | GPU: RTX 3090-5090 (VRAM 24GB+) | Estimated Time: 4-8 hours
Objective
Equip Llama-3.2-3B with both general knowledge and CoT (Chain-of-Thought) reasoning ability using the Mixture-of-LoRAs strategy. Learn how to reinforce math reasoning while preserving existing capabilities through a 2-stage pipeline training.
Key Points
- Stage 1: General SFT (50K English-Korean mix) -- foundational instruction-following
- Stage 2: CoT SFT (OpenR1-Math 40K) -- step-by-step math reasoning
- Mixture-of-LoRAs: 4 experts naturally differentiate by domain
Prerequisites
Data
Run Section 4 (Reasoning/CoT Data) from 19_data_collection.md to prepare the following:
data/reasoning/openr1_math_40k.jsonl # CoT training (40K)
data/reasoning/openr1_math_1k.jsonl # CoT bench (1K)
data/sft_50k_en_ko_raw.jsonl # General SFT (included by default)
data/sft_1k_en_ko_raw.jsonl # General SFT bench (included by default)
Script
cp examples/run_llama32_pipeline_reasoning.sh ./
chmod +x run_llama32_pipeline_reasoning.sh
2-Stage Training Design
Stage 1: General SFT
Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml
injection:
strategy: mixture_lora
num_experts: 4
top_k: 2
lora_r: 32
lora_dropout: 0.08 # Higher dropout at the beginning
training:
phases:
- step: 0
trainable: ["router"] # Router-only warmup
- step: 1000
trainable: ["router", "lora", "attn_lora"] # Open all LoRA
lr: 6.0e-5
max_train_steps: 5000 # Overridden by script
- Train the router alone for the first 1000 steps to stabilize expert distribution
- High router_z_loss (0.0003) and aux_loss (0.003) enforce load balancing
Stage 2: CoT SFT (Key Changes)
Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml
Key differences from Stage 1:
| Parameter | Stage 1 (General) | Stage 2 (CoT) | Reason |
|---|---|---|---|
| Router warmup | step 0-1000 | None | Protect existing routing rules |
lr |
6.0e-5 | 1.5e-5 | Prevent destroying existing knowledge |
lora_dropout |
0.08 | 0.05 | Preserve existing knowledge |
router_z_loss_coef |
0.0003 | 0.0001 | Relax MoE penalty |
aux_loss_coef |
0.003 | 0.001 | Allow expert skew |
max_train_steps |
5000 | 3000 | Prevent overfitting |
training:
phases:
- step: 0
trainable: ["router", "lora", "attn_lora"] # Open all simultaneously
lr: 1.5e-5 # 4x lower LR
Why remove router-only training: The router is already initialized from Stage 1. Training only the router again in Stage 2 would reset existing routing rules with math data, potentially destroying the expert allocation for general knowledge.
Why relax MoE penalties: Since only math data is coming in, it is natural for traffic to concentrate on one or two experts. Strong balancing penalties would actually hinder learning.
Execution
./run_llama32_pipeline_reasoning.sh
Or step-by-step manual execution:
# Stage 1: General SFT (5000 steps)
eulerforge train \
--preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml \
--set model_name=meta-llama/Llama-3.2-3B \
--set data.format=raw --set data.task=sft \
--set data.path=data/sft_50k_en_ko_raw.jsonl \
--set data.max_length=512 \
--set training.max_train_steps=5000 \
--set training.batch_size=10 --set training.grad_accum_steps=2 \
--output-dir outputs/reasoning/stage1_general_sft
# Stage 2: CoT SFT (based on Stage 1 checkpoint, 3000 steps)
eulerforge train \
--preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml \
--set model_name=outputs/reasoning/stage1_general_sft/final \
--set data.format=raw --set data.task=sft \
--set data.path=data/reasoning/openr1_math_40k.jsonl \
--set data.max_length=1024 \
--set training.max_train_steps=3000 \
--set training.batch_size=8 --set training.grad_accum_steps=2 \
--output-dir outputs/reasoning/stage2_reasoning_sft
Interpreting Results
Expected Patterns
- Stage 1 bench: General instruction-following ability acquired (improvement over base)
- Stage 2 bench: CoT reasoning ability added + general ability maintained (or slight decline)
CoT Overfitting Warning
Overfitting can occur if you train too long on only math data in Stage 2: - Math bench scores improve but general bench scores drop sharply - Experiment by reducing Stage 2 steps or lowering the lr further
Advanced: Single-Stage Combined Training (Exercise)
The current script is split into 2 stages, but since Mixture-of-LoRAs has a 4-expert structure, it is also possible to mix both datasets and train in a single pass:
# Combine general SFT + CoT data into one
cat data/sft_50k_en_ko_raw.jsonl data/reasoning/openr1_math_40k.jsonl \
| shuf --random-source=<(yes 42) > data/reasoning/mixed_90k.jsonl
echo "Mixed: $(wc -l < data/reasoning/mixed_90k.jsonl) rows"
Exercise: Train on mixed_90k.jsonl using a single llama3.2_3b_mixture_lora_sft.yml preset and compare the results with the 2-stage pipeline.
- Advantages: Simplified pipeline; experts naturally differentiate by domain
- Disadvantages: Difficult to control CoT data ratio (the 2-stage approach allows fine-grained control via lr/steps)
Use --metrics-level advanced to check whether the MoE router automatically routes general vs. math tokens to different experts.
Related Documents
- 02_mixture_lora.md -- Mixture-of-LoRAs basics
- 09_moe_stability_and_validation.md -- MoE stability
- 18_training_pipeline.md -- Pipeline training guide