10. Metrics Monitoring
Two-level metrics system and TensorBoard integration guide.
Overview
EulerForge collects metrics during training at two levels (minimal/advanced). You can visually monitor them through TensorBoard.
Metrics Levels
minimal (default)
Base metrics recorded for all training types:
| Tag | Description |
|---|---|
train/main_loss |
Main loss |
train/total_loss |
Total loss (main + aux) |
train/aux_loss |
MoE auxiliary loss (when using MoE strategy) |
train/learning_rate |
Current learning rate |
train/grad_norm |
Global L2 gradient norm |
train/tokens_seen |
Cumulative training tokens (labels != -100) |
train/samples_seen |
Cumulative sample count (preference: counted per pair) |
train/optimizer_step |
Cumulative optimizer step count |
train/micro_step |
Cumulative micro step count |
train/effective_batch |
Effective batch size (batch_size x grad_accum_steps) |
Additional metrics by training type:
| Type | Additional Tags |
|---|---|
| DPO | train/reward_margin, train/accuracy |
| ORPO | train/sft_loss, train/orpo_loss, train/log_odds_ratio |
| RM | train/reward_margin |
| PPO | train/kl, train/reward_mean, train/advantages_mean |
advanced (minimal + MoE routing statistics)
Additional metrics recorded for MoE strategies (mixture_lora, moe_expert_lora):
| Tag | Description | Interpretation |
|---|---|---|
moe/token_frac_std |
Token fraction standard deviation | High value indicates expert imbalance |
moe/entropy_mean |
Router entropy | Low value suggests routing collapse |
moe/importance_cv |
Importance coefficient of variation | >1 indicates severe imbalance |
moe/aux_loss_total |
Total auxiliary loss | Spike requires learning rate adjustment |
moe/router_logit_max |
Router logit maximum | >10 suggests numerical blowup |
Configuration
logging:
metrics_level: minimal # "minimal" | "advanced"
tensorboard:
enabled: true # TensorBoard logging (default: false)
log_dir: "outputs/tb" # TensorBoard log directory
log_interval: 50 # Log to TensorBoard every N steps
max_experts_log: 16 # Detailed log for top N experts in advanced mode
CLI override:
# Override config file value to use advanced
eulerforge train --preset PRESET.yml --metrics-level advanced
# Also possible with --set
eulerforge train --preset PRESET.yml --set logging.metrics_level=advanced
TensorBoard Installation and Usage
# Installation
pip install eulerforge[tb]
# Run training
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft.yml \
--set logging.tensorboard.enabled=true \
--set logging.tensorboard.log_dir=outputs/tb \
--metrics-level advanced
# Start TensorBoard
tensorboard --logdir outputs/tb
If tensorboard is not installed, only a warning is printed and training proceeds normally.
Validation
The logging section is checked during config validation:
Logging Config: Unknown metrics_level 'invalid'. Valid: minimal, advanced
Fix: Set logging.metrics_level to 'minimal' or 'advanced'
See: docs/tutorials/10_metrics_monitoring.md
Related Documents
- CLI Reference — Full CLI options
- MoE Stability Guide — MoE parameter tuning