Home > EulerForge > Tutorials > 10. Metrics Monitoring

10. Metrics Monitoring

Two-level metrics system and TensorBoard integration guide.


Overview

EulerForge collects metrics during training at two levels (minimal/advanced). You can visually monitor them through TensorBoard.


Metrics Levels

minimal (default)

Base metrics recorded for all training types:

Tag Description
train/main_loss Main loss
train/total_loss Total loss (main + aux)
train/aux_loss MoE auxiliary loss (when using MoE strategy)
train/learning_rate Current learning rate
train/grad_norm Global L2 gradient norm
train/tokens_seen Cumulative training tokens (labels != -100)
train/samples_seen Cumulative sample count (preference: counted per pair)
train/optimizer_step Cumulative optimizer step count
train/micro_step Cumulative micro step count
train/effective_batch Effective batch size (batch_size x grad_accum_steps)

Additional metrics by training type:

Type Additional Tags
DPO train/reward_margin, train/accuracy
ORPO train/sft_loss, train/orpo_loss, train/log_odds_ratio
RM train/reward_margin
PPO train/kl, train/reward_mean, train/advantages_mean

advanced (minimal + MoE routing statistics)

Additional metrics recorded for MoE strategies (mixture_lora, moe_expert_lora):

Tag Description Interpretation
moe/token_frac_std Token fraction standard deviation High value indicates expert imbalance
moe/entropy_mean Router entropy Low value suggests routing collapse
moe/importance_cv Importance coefficient of variation >1 indicates severe imbalance
moe/aux_loss_total Total auxiliary loss Spike requires learning rate adjustment
moe/router_logit_max Router logit maximum >10 suggests numerical blowup

Configuration

logging:
  metrics_level: minimal       # "minimal" | "advanced"
  tensorboard:
    enabled: true              # TensorBoard logging (default: false)
    log_dir: "outputs/tb"      # TensorBoard log directory
  log_interval: 50             # Log to TensorBoard every N steps
  max_experts_log: 16          # Detailed log for top N experts in advanced mode

CLI override:

# Override config file value to use advanced
eulerforge train --preset PRESET.yml --metrics-level advanced

# Also possible with --set
eulerforge train --preset PRESET.yml --set logging.metrics_level=advanced

TensorBoard Installation and Usage

# Installation
pip install eulerforge[tb]

# Run training
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_sft.yml \
    --set logging.tensorboard.enabled=true \
    --set logging.tensorboard.log_dir=outputs/tb \
    --metrics-level advanced

# Start TensorBoard
tensorboard --logdir outputs/tb

If tensorboard is not installed, only a warning is printed and training proceeds normally.


Validation

The logging section is checked during config validation:

Logging Config: Unknown metrics_level 'invalid'. Valid: minimal, advanced
Fix: Set logging.metrics_level to 'minimal' or 'advanced'
See: docs/tutorials/10_metrics_monitoring.md