11. Inference Benchmark

Overview

eulerforge bench evaluates the inference quality of fine-tuned models. With a single YAML spec, you configure 3 types of models — target/baseline/judge — and it automatically performs response generation → comparison → evaluation.

Target model specification methods:

Method	Use Case
`target.model` (API model)	When using an Ollama/OpenAI/Gemini server model
`target.output_dir` (training output)	Auto-load checkpoint from `eulerforge train` output directory
`target.model_dir` (HF directory)	When directly specifying a `save_pretrained()` result

Preparation

1. Install Ollama and Pull Models

# Install Ollama (https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh

# Download models
ollama pull qwen3:0.6b      # target
ollama pull qwen3:4b         # baseline (optional)
ollama pull gemma3:27b       # judge (optional)

2. Bench Data

EulerForge uses pre-generated bench data in the data/ directory:

File	Format	task
`data/sft_1k_bench_raw.jsonl`	`{prompt, response}`	sft
`data/dpo_1k_bench_raw.jsonl`	`{prompt, chosen, rejected}`	preference

Quick Start

Evaluate Target Model Only

eulerforge bench --preset configs/bench/sft_target_only.yml

Configuration Override

# Change sample count
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.sample.k=20

# Change model
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.models.target.model=qwen3:4b

Dry Run (Check samples without model calls)

eulerforge bench --preset configs/bench/sft_target_only.yml --dry-run

Execution Modes

1. Target Only

Generates and outputs only the target model's responses.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: false
    judge:
      enabled: false

2. Target + Baseline Comparison

Outputs responses from both models side by side.

# Method A: Use an Ollama model as baseline
bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true
      provider: ollama
      model: "qwen3:4b"
    judge:
      enabled: false

# Method B: Load HF model directly as baseline (inference with transformers on GPU)
# Useful when you want to compare against the same base model used for fine-tuning
bench:
  models:
    target:
      model_dir: "outputs/run_20260330_204433/final"
      device: "cuda:0"
      dtype: "float16"
    baseline:
      enabled: true
      model_dir: "Qwen/Qwen2.5-0.5B"   # HF Hub name or local path
      device: "cuda:0"
      dtype: "float16"
    judge:
      enabled: false

Note: When model_dir is specified for baseline, it loads directly via HF transformers without a provider. This allows comparison against the exact same base model instead of Ollama's instruction-tuned model. To prevent OOM, target and baseline are loaded sequentially.

3. Pointwise Judge

The judge model evaluates target responses on a 1-10 scale. If baseline is enabled, both target and baseline receive separate pointwise evaluations.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true        # Baseline can also be enabled in pointwise mode
      model: "qwen3.5:0.8b"
    judge:
      enabled: true
      model: "gpt-oss:20b"
      mode: pointwise

In pointwise mode with baseline enabled: - Target and baseline responses are each evaluated independently on a 1-10 scale - Summary shows target average and baseline average separately

4. Pairwise Judge

The judge model performs comparative evaluation of target vs baseline. With mitigate_position_bias: true, it runs A/B swap evaluation twice to mitigate position bias.

bench:
  models:
    target:
      model: "qwen3:0.6b"
    baseline:
      enabled: true
      model: "qwen3:4b"
    judge:
      enabled: true
      model: "gemma3:27b"
      mode: pairwise
      mitigate_position_bias: true

Using Local HF Models (Evaluating Training Results)

You can benchmark checkpoints generated after eulerforge train directly.

Automatic LoRA/MoE Checkpoint Handling: If the checkpoint contains LoRA or MoE keys, they are handled automatically. Loading approach by strategy:

Strategy Saved Structure Bench Loading Inference Model

dense_lora base + LoRA (per-Linear) base + (B @ A) * (α/r) dense model

mixture_lora base(1) + router + N LoRA experts MixtureLoRA structure reconstruction → Attn LoRA merge MixtureLoRA model (structure preserved)

moe_expert_lora N expert FFNs + router MoE structure reconstruction → LoRA merge → load MoE model (structure preserved)

moe_expert_lora + handoff N expert FFNs (no LoRA) + router MoE structure reconstruction → direct load MoE model (structure preserved)

moe_expert_lora: Reads injection settings (num_experts, top_k, etc.) from resolved_config.json to reconstruct the MoE architecture. Infers with all N experts + router structure intact, without averaging experts.

mixture_lora: Reads injection settings from resolved_config.json and reconstructs the MixtureLoRA structure using build_mixture_lora_for_ffn_layers(). Preserves the router + N LoRA experts structure to maintain routing diversity. Falls back to expert averaging → dense model conversion if resolved_config.json is missing.

Automatic Quantized Checkpoint Handling: Checkpoints quantized with model.load_precision are automatically dequantized by _dequantize_bnb_state_dict() before LoRA merging: - int4/nf4: packed weight (N, 1) → dequantize_4bit() → full precision. Automatically detects bf16-cast packed data during handoff via numel comparison - int8: .SCB companion key (row-wise scale factor) detection → int8 * (SCB / 127) → bfloat16. .SCB + .weight_format companion keys automatically removed

Automatic Key Prefix Normalization: Some models like Qwen3.5 have checkpoint keys (model.language_model.*) that differ from keys expected by from_config() (model.*). Prefixes are automatically mapped after merging.

Weight tying (normal behavior): Models with tie_word_embeddings=True (such as Qwen3) do not save lm_head.weight in the checkpoint (shared with embed_tokens.weight). tie_weights() is automatically called during loading to restore it, so no "missing key" warning is displayed.

Automatic Embedding Resize: When the tokenizer has additional special tokens (<|im_start|>, <think>, etc.) and len(tokenizer) > config.vocab_size, resize_token_embeddings() is automatically called. Without this handling, tokens generated by apply_chat_template would exceed the embedding range, causing CUDA error: device-side assert triggered.

Automatic float16 → bfloat16 Conversion: When dtype: "float16" is specified, it is automatically converted to bfloat16. float16 has a limited range of ~65504, and in architectures with linear attention (Mamba) like Qwen3.5, internal states during autoregressive generation exceed this range, causing NaN → CUDA error: device-side assert triggered. bfloat16 uses the same 16-bit memory while offering a safe range of ~3.4e38.

Strategy	Saved Structure	Bench Loading	Inference Model
dense_lora	base + LoRA (per-Linear)	`base + (B @ A) * (α/r)`	dense model
mixture_lora	base(1) + router + N LoRA experts	MixtureLoRA structure reconstruction → Attn LoRA merge	MixtureLoRA model (structure preserved)
moe_expert_lora	N expert FFNs + router	MoE structure reconstruction → LoRA merge → load	MoE model (structure preserved)
moe_expert_lora + handoff	N expert FFNs (no LoRA) + router	MoE structure reconstruction → direct load	MoE model (structure preserved)

Method A: Specify Training Output Directory

bench:
  models:
    target:
      output_dir: "outputs/run_20260301_120000"   # Training output root
      checkpoint: "final"                          # final | latest | best
      device: "auto"                               # Default: auto
      dtype: "auto"                                # Default: auto

Checkpoint auto-resolution order:

`checkpoint`	Search Path
`final` (default)	`{output_dir}/final/`
`latest`	`{output_dir}/checkpoint-latest/` → latest `checkpoint-N/`
`best`	`{output_dir}/checkpoint-best/`

Method B: Specify Model Directory Directly

bench:
  models:
    target:
      model_dir: "outputs/run_20260301_120000/final"
      device: "cuda:0"
      dtype: "float16"

CLI Flag Overrides

# Benchmark immediately after training completes
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000

# Use latest checkpoint (mid-training evaluation)
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --checkpoint latest

# Specify path directly
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-model-dir outputs/run_20260301_120000/final

# Dry run to check data
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --dry-run

# Specify GPU (load target on GPU 1 when judge uses GPU 0)
eulerforge bench --preset configs/bench/sft_local.yml \
  --target-output-dir outputs/run_20260301_120000 --target-device cuda:1

Error Messages (3-line format)

When checkpoint is missing:

Bench Config: No checkpoint found in outputs/run_xxx (checkpoint=final): 'final/' directory missing
Fix: Run eulerforge train first, or use --checkpoint latest/best, or --target-model-dir
See: docs/tutorials/11_bench.md

Mutual exclusion error:

Bench Config: bench.models.target: exactly one of 'model', 'output_dir', 'model_dir' is allowed — found multiple
Fix: Remove all but one: model (API), output_dir (run dir), or model_dir (HF dir)
See: docs/tutorials/11_bench.md

External APIs (OpenAI / Gemini)

You can use OpenAI or Gemini as the judge model:

bench:
  models:
    judge:
      enabled: true
      provider: openai           # ollama | openai | gemini
      model: "gpt-4o"
      base_url: "https://api.openai.com/v1"
      api_key_env: "OPENAI_API_KEY"   # Read API key from environment variable
      mode: pointwise

export OPENAI_API_KEY="sk-..."
eulerforge bench --preset configs/bench/sft_with_openai_judge.yml

Output

Terminal Output

============================================================
[Sample 0]
Prompt: Important factors to consider when choosing recording equipment...

Target (qwen3:0.6b):
  Important factors to consider when choosing recording equipment...

Judge (pointwise): Target=7/10  Baseline=8/10
  [Target]   The response covers key factors clearly...
  [Baseline] The baseline provides a more structured answer...
============================================================
[Bench Summary]
  Task: sft
  Samples: 10
  Target: qwen3:0.6b
  Baseline: qwen3.5:0.8b
  Pointwise Target:   avg=6.4 min=4 max=9
  Pointwise Baseline: avg=7.8 min=6 max=9
============================================================

Saved Files

outputs/bench/
├── bench_results.jsonl           # Individual results (prompts, responses, scores)
├── bench_summary.json            # Summary statistics
└── bench_resolved_config.json    # Snapshot of the configuration used

Provided Example YAMLs

File	Description
`configs/bench/sft_target_only.yml`	SFT target only
`configs/bench/sft_with_judge.yml`	SFT + pointwise judge
`configs/bench/preference_pairwise.yml`	Preference + pairwise judge

CLI Options

Option	Description
`--preset PATH`	Bench YAML spec file (required)
`--set KEY=VALUE`	Configuration override (repeatable)
`--output-dir DIR`	Result output directory
`--validate-only`	Perform config validation only
`--dry-run`	Sample extraction only (no model calls)
`--target-output-dir PATH`	Target local model: training output root directory
`--checkpoint TYPE`	Checkpoint type: `final` (default) \| `latest` \| `best`
`--target-model-dir PATH`	Target local model: HF save_pretrained directory, specified directly
`--target-device DEVICE`	Target local model device override (e.g., `cuda:0`, `cuda:1`, `cpu`). Default: auto

Sequential Model Loading (OOM Prevention)

Bench uses up to 3 models (target, baseline, judge). To prevent GPU memory overflow (OOM), models are loaded one at a time, processing all data before unloading.

Phase 1: Load target → Infer all samples → Unload
Phase 2: Load baseline → Infer all samples → Unload
Phase 3: Load judge → Evaluate all samples → Unload
Phase 4: Aggregate results → Same output as before

No more than 2 models exist in memory simultaneously
API clients (Ollama/OpenAI/Gemini) do not use GPU resources, so unload is a no-op
Local HF models (LocalHFClient) delete model/tokenizer + call torch.cuda.empty_cache() on unload
Result output (JSONL, terminal, summary) is completely identical to the previous format

Details: cli.md

Ollama Thinking Model Support

Thinking models like Qwen 3.5 return responses from Ollama with an empty content field and the thinking process in the reasoning field.

EulerForge bench directly calls the native API (/api/chat) when using the Ollama provider and sends think: false to disable thinking. This ensures that thinking models also return responses normally in the content field.

Ollama provider: Uses /api/chat native API (supports thinking disable)
OpenAI/Gemini provider: Uses /chat/completions OpenAI-compatible API

Even if the config's base_url ends with /v1, the Ollama provider automatically strips /v1 and calls the native API.

← Prev 10. Metrics Monitoring 12. Hyperparameter Search (Grid / Random / Bayes) Next →