21. Lab: Chain-of-Thought Reasoning Model

Difficulty: Advanced | GPU: RTX 3090-5090 (VRAM 24GB+) | Estimated Time: 4-8 hours

Objective

Equip Llama-3.2-3B with both general knowledge and CoT (Chain-of-Thought) reasoning ability using the Mixture-of-LoRAs strategy. Learn how to reinforce math reasoning while preserving existing capabilities through a 2-stage pipeline training.

Key Points

Stage 1: General SFT (50K English-Korean mix) -- foundational instruction-following
Stage 2: CoT SFT (OpenR1-Math 40K) -- step-by-step math reasoning
Mixture-of-LoRAs: 4 experts naturally differentiate by domain

Prerequisites

Data

Run Section 4 (Reasoning/CoT Data) from 19_data_collection.md to prepare the following:

data/reasoning/openr1_math_40k.jsonl   # CoT training (40K)
data/reasoning/openr1_math_1k.jsonl    # CoT bench (1K)
data/sft_50k_en_ko_raw.jsonl           # General SFT (included by default)
data/sft_1k_en_ko_raw.jsonl            # General SFT bench (included by default)

Script

cp examples/run_llama32_pipeline_reasoning.sh ./
chmod +x run_llama32_pipeline_reasoning.sh

2-Stage Training Design

Stage 1: General SFT

Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml

injection:
  strategy: mixture_lora
  num_experts: 4
  top_k: 2
  lora_r: 32
  lora_dropout: 0.08           # Higher dropout at the beginning
training:
  phases:
    - step: 0
      trainable: ["router"]             # Router-only warmup
    - step: 1000
      trainable: ["router", "lora", "attn_lora"]  # Open all LoRA
  lr: 6.0e-5
  max_train_steps: 5000       # Overridden by script

Train the router alone for the first 1000 steps to stabilize expert distribution
High router_z_loss (0.0003) and aux_loss (0.003) enforce load balancing

Stage 2: CoT SFT (Key Changes)

Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml

Key differences from Stage 1:

Parameter	Stage 1 (General)	Stage 2 (CoT)	Reason
Router warmup	step 0-1000	None	Protect existing routing rules
`lr`	6.0e-5	1.5e-5	Prevent destroying existing knowledge
`lora_dropout`	0.08	0.05	Preserve existing knowledge
`router_z_loss_coef`	0.0003	0.0001	Relax MoE penalty
`aux_loss_coef`	0.003	0.001	Allow expert skew
`max_train_steps`	5000	3000	Prevent overfitting

training:
  phases:
    - step: 0
      trainable: ["router", "lora", "attn_lora"]  # Open all simultaneously
  lr: 1.5e-5                                       # 4x lower LR

Why remove router-only training: The router is already initialized from Stage 1. Training only the router again in Stage 2 would reset existing routing rules with math data, potentially destroying the expert allocation for general knowledge.

Why relax MoE penalties: Since only math data is coming in, it is natural for traffic to concentrate on one or two experts. Strong balancing penalties would actually hinder learning.

Execution

./run_llama32_pipeline_reasoning.sh

Or step-by-step manual execution:

# Stage 1: General SFT (5000 steps)
eulerforge train \
  --preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml \
  --set model_name=meta-llama/Llama-3.2-3B \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/sft_50k_en_ko_raw.jsonl \
  --set data.max_length=512 \
  --set training.max_train_steps=5000 \
  --set training.batch_size=10 --set training.grad_accum_steps=2 \
  --output-dir outputs/reasoning/stage1_general_sft

# Stage 2: CoT SFT (based on Stage 1 checkpoint, 3000 steps)
eulerforge train \
  --preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml \
  --set model_name=outputs/reasoning/stage1_general_sft/final \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/reasoning/openr1_math_40k.jsonl \
  --set data.max_length=1024 \
  --set training.max_train_steps=3000 \
  --set training.batch_size=8 --set training.grad_accum_steps=2 \
  --output-dir outputs/reasoning/stage2_reasoning_sft

Interpreting Results

Expected Patterns

Stage 1 bench: General instruction-following ability acquired (improvement over base)
Stage 2 bench: CoT reasoning ability added + general ability maintained (or slight decline)

CoT Overfitting Warning

Overfitting can occur if you train too long on only math data in Stage 2: - Math bench scores improve but general bench scores drop sharply - Experiment by reducing Stage 2 steps or lowering the lr further

Advanced: Single-Stage Combined Training (Exercise)

The current script is split into 2 stages, but since Mixture-of-LoRAs has a 4-expert structure, it is also possible to mix both datasets and train in a single pass:

# Combine general SFT + CoT data into one
cat data/sft_50k_en_ko_raw.jsonl data/reasoning/openr1_math_40k.jsonl \
  | shuf --random-source=<(yes 42) > data/reasoning/mixed_90k.jsonl

echo "Mixed: $(wc -l < data/reasoning/mixed_90k.jsonl) rows"

Exercise: Train on mixed_90k.jsonl using a single llama3.2_3b_mixture_lora_sft.yml preset and compare the results with the 2-stage pipeline.

Advantages: Simplified pipeline; experts naturally differentiate by domain
Disadvantages: Difficult to control CoT data ratio (the 2-stage approach allows fine-grained control via lr/steps)

Use --metrics-level advanced to check whether the MoE router automatically routes general vs. math tokens to different experts.

02_mixture_lora.md -- Mixture-of-LoRAs basics
09_moe_stability_and_validation.md -- MoE stability
18_training_pipeline.md -- Pipeline training guide

← Prev 20. Lab: Math / Coding Model 22. Lab: Korean Finance Copilot Next →