Home > EulerForge > Tutorials > 21. Lab: Chain-of-Thought Reasoning Model

21. Lab: Chain-of-Thought Reasoning Model

Difficulty: Advanced | GPU: RTX 3090-5090 (VRAM 24GB+) | Estimated Time: 4-8 hours

Objective

Equip Llama-3.2-3B with both general knowledge and CoT (Chain-of-Thought) reasoning ability using the Mixture-of-LoRAs strategy. Learn how to reinforce math reasoning while preserving existing capabilities through a 2-stage pipeline training.

Key Points


Prerequisites

Data

Run Section 4 (Reasoning/CoT Data) from 19_data_collection.md to prepare the following:

data/reasoning/openr1_math_40k.jsonl   # CoT training (40K)
data/reasoning/openr1_math_1k.jsonl    # CoT bench (1K)
data/sft_50k_en_ko_raw.jsonl           # General SFT (included by default)
data/sft_1k_en_ko_raw.jsonl            # General SFT bench (included by default)

Script

cp examples/run_llama32_pipeline_reasoning.sh ./
chmod +x run_llama32_pipeline_reasoning.sh

2-Stage Training Design

Stage 1: General SFT

Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml

injection:
  strategy: mixture_lora
  num_experts: 4
  top_k: 2
  lora_r: 32
  lora_dropout: 0.08           # Higher dropout at the beginning
training:
  phases:
    - step: 0
      trainable: ["router"]             # Router-only warmup
    - step: 1000
      trainable: ["router", "lora", "attn_lora"]  # Open all LoRA
  lr: 6.0e-5
  max_train_steps: 5000       # Overridden by script

Stage 2: CoT SFT (Key Changes)

Config: configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml

Key differences from Stage 1:

Parameter Stage 1 (General) Stage 2 (CoT) Reason
Router warmup step 0-1000 None Protect existing routing rules
lr 6.0e-5 1.5e-5 Prevent destroying existing knowledge
lora_dropout 0.08 0.05 Preserve existing knowledge
router_z_loss_coef 0.0003 0.0001 Relax MoE penalty
aux_loss_coef 0.003 0.001 Allow expert skew
max_train_steps 5000 3000 Prevent overfitting
training:
  phases:
    - step: 0
      trainable: ["router", "lora", "attn_lora"]  # Open all simultaneously
  lr: 1.5e-5                                       # 4x lower LR

Why remove router-only training: The router is already initialized from Stage 1. Training only the router again in Stage 2 would reset existing routing rules with math data, potentially destroying the expert allocation for general knowledge.

Why relax MoE penalties: Since only math data is coming in, it is natural for traffic to concentrate on one or two experts. Strong balancing penalties would actually hinder learning.


Execution

./run_llama32_pipeline_reasoning.sh

Or step-by-step manual execution:

# Stage 1: General SFT (5000 steps)
eulerforge train \
  --preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft.yml \
  --set model_name=meta-llama/Llama-3.2-3B \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/sft_50k_en_ko_raw.jsonl \
  --set data.max_length=512 \
  --set training.max_train_steps=5000 \
  --set training.batch_size=10 --set training.grad_accum_steps=2 \
  --output-dir outputs/reasoning/stage1_general_sft

# Stage 2: CoT SFT (based on Stage 1 checkpoint, 3000 steps)
eulerforge train \
  --preset configs/presets/reasoning/llama3.2_3b_mixture_lora_sft_cot.yml \
  --set model_name=outputs/reasoning/stage1_general_sft/final \
  --set data.format=raw --set data.task=sft \
  --set data.path=data/reasoning/openr1_math_40k.jsonl \
  --set data.max_length=1024 \
  --set training.max_train_steps=3000 \
  --set training.batch_size=8 --set training.grad_accum_steps=2 \
  --output-dir outputs/reasoning/stage2_reasoning_sft

Interpreting Results

Expected Patterns

CoT Overfitting Warning

Overfitting can occur if you train too long on only math data in Stage 2: - Math bench scores improve but general bench scores drop sharply - Experiment by reducing Stage 2 steps or lowering the lr further


Advanced: Single-Stage Combined Training (Exercise)

The current script is split into 2 stages, but since Mixture-of-LoRAs has a 4-expert structure, it is also possible to mix both datasets and train in a single pass:

# Combine general SFT + CoT data into one
cat data/sft_50k_en_ko_raw.jsonl data/reasoning/openr1_math_40k.jsonl \
  | shuf --random-source=<(yes 42) > data/reasoning/mixed_90k.jsonl

echo "Mixed: $(wc -l < data/reasoning/mixed_90k.jsonl) rows"

Exercise: Train on mixed_90k.jsonl using a single llama3.2_3b_mixture_lora_sft.yml preset and compare the results with the 2-stage pipeline.

Use --metrics-level advanced to check whether the MoE router automatically routes general vs. math tokens to different experts.