11. Inference Benchmark
Overview
eulerforge bench evaluates the inference quality of fine-tuned models. With a single YAML spec, you configure 3 types of models — target/baseline/judge — and it automatically performs response generation → comparison → evaluation.
Target model specification methods:
| Method | Use Case |
|---|---|
target.model (API model) |
When using an Ollama/OpenAI/Gemini server model |
target.output_dir (training output) |
Auto-load checkpoint from eulerforge train output directory |
target.model_dir (HF directory) |
When directly specifying a save_pretrained() result |
Preparation
1. Install Ollama and Pull Models
# Install Ollama (https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh
# Download models
ollama pull qwen3:0.6b # target
ollama pull qwen3:4b # baseline (optional)
ollama pull gemma3:27b # judge (optional)
2. Bench Data
EulerForge uses pre-generated bench data in the data/ directory:
| File | Format | task |
|---|---|---|
data/sft_1k_bench_raw.jsonl |
{prompt, response} |
sft |
data/dpo_1k_bench_raw.jsonl |
{prompt, chosen, rejected} |
preference |
Quick Start
Evaluate Target Model Only
eulerforge bench --preset configs/bench/sft_target_only.yml
Configuration Override
# Change sample count
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.sample.k=20
# Change model
eulerforge bench --preset configs/bench/sft_target_only.yml --set bench.models.target.model=qwen3:4b
Dry Run (Check samples without model calls)
eulerforge bench --preset configs/bench/sft_target_only.yml --dry-run
Execution Modes
1. Target Only
Generates and outputs only the target model's responses.
bench:
models:
target:
model: "qwen3:0.6b"
baseline:
enabled: false
judge:
enabled: false
2. Target + Baseline Comparison
Outputs responses from both models side by side.
# Method A: Use an Ollama model as baseline
bench:
models:
target:
model: "qwen3:0.6b"
baseline:
enabled: true
provider: ollama
model: "qwen3:4b"
judge:
enabled: false
# Method B: Load HF model directly as baseline (inference with transformers on GPU)
# Useful when you want to compare against the same base model used for fine-tuning
bench:
models:
target:
model_dir: "outputs/run_20260330_204433/final"
device: "cuda:0"
dtype: "float16"
baseline:
enabled: true
model_dir: "Qwen/Qwen2.5-0.5B" # HF Hub name or local path
device: "cuda:0"
dtype: "float16"
judge:
enabled: false
Note: When
model_diris specified for baseline, it loads directly via HF transformers without aprovider. This allows comparison against the exact same base model instead of Ollama's instruction-tuned model. To prevent OOM, target and baseline are loaded sequentially.
3. Pointwise Judge
The judge model evaluates target responses on a 1-10 scale. If baseline is enabled, both target and baseline receive separate pointwise evaluations.
bench:
models:
target:
model: "qwen3:0.6b"
baseline:
enabled: true # Baseline can also be enabled in pointwise mode
model: "qwen3.5:0.8b"
judge:
enabled: true
model: "gpt-oss:20b"
mode: pointwise
In pointwise mode with baseline enabled: - Target and baseline responses are each evaluated independently on a 1-10 scale - Summary shows target average and baseline average separately
4. Pairwise Judge
The judge model performs comparative evaluation of target vs baseline. With mitigate_position_bias: true, it runs A/B swap evaluation twice to mitigate position bias.
bench:
models:
target:
model: "qwen3:0.6b"
baseline:
enabled: true
model: "qwen3:4b"
judge:
enabled: true
model: "gemma3:27b"
mode: pairwise
mitigate_position_bias: true
Using Local HF Models (Evaluating Training Results)
You can benchmark checkpoints generated after eulerforge train directly.
Automatic LoRA/MoE Checkpoint Handling: If the checkpoint contains LoRA or MoE keys, they are handled automatically. Loading approach by strategy:
Strategy Saved Structure Bench Loading Inference Model dense_lora base + LoRA (per-Linear) base + (B @ A) * (α/r)dense model mixture_lora base(1) + router + N LoRA experts MixtureLoRA structure reconstruction → Attn LoRA merge MixtureLoRA model (structure preserved) moe_expert_lora N expert FFNs + router MoE structure reconstruction → LoRA merge → load MoE model (structure preserved) moe_expert_lora + handoff N expert FFNs (no LoRA) + router MoE structure reconstruction → direct load MoE model (structure preserved)
- moe_expert_lora: Reads injection settings (num_experts, top_k, etc.) from
resolved_config.jsonto reconstruct the MoE architecture. Infers with all N experts + router structure intact, without averaging experts.- mixture_lora: Reads injection settings from
resolved_config.jsonand reconstructs the MixtureLoRA structure usingbuild_mixture_lora_for_ffn_layers(). Preserves the router + N LoRA experts structure to maintain routing diversity. Falls back to expert averaging → dense model conversion ifresolved_config.jsonis missing.Automatic Quantized Checkpoint Handling: Checkpoints quantized with
model.load_precisionare automatically dequantized by_dequantize_bnb_state_dict()before LoRA merging: - int4/nf4: packed weight(N, 1)→dequantize_4bit()→ full precision. Automatically detects bf16-cast packed data during handoff via numel comparison - int8:.SCBcompanion key (row-wise scale factor) detection →int8 * (SCB / 127)→ bfloat16..SCB+.weight_formatcompanion keys automatically removedAutomatic Key Prefix Normalization: Some models like Qwen3.5 have checkpoint keys (
model.language_model.*) that differ from keys expected byfrom_config()(model.*). Prefixes are automatically mapped after merging.Weight tying (normal behavior): Models with
tie_word_embeddings=True(such as Qwen3) do not savelm_head.weightin the checkpoint (shared withembed_tokens.weight).tie_weights()is automatically called during loading to restore it, so no "missing key" warning is displayed.Automatic Embedding Resize: When the tokenizer has additional special tokens (
<|im_start|>,<think>, etc.) andlen(tokenizer) > config.vocab_size,resize_token_embeddings()is automatically called. Without this handling, tokens generated byapply_chat_templatewould exceed the embedding range, causingCUDA error: device-side assert triggered.Automatic float16 → bfloat16 Conversion: When
dtype: "float16"is specified, it is automatically converted to bfloat16. float16 has a limited range of ~65504, and in architectures with linear attention (Mamba) like Qwen3.5, internal states during autoregressive generation exceed this range, causing NaN →CUDA error: device-side assert triggered. bfloat16 uses the same 16-bit memory while offering a safe range of ~3.4e38.
Method A: Specify Training Output Directory
bench:
models:
target:
output_dir: "outputs/run_20260301_120000" # Training output root
checkpoint: "final" # final | latest | best
device: "auto" # Default: auto
dtype: "auto" # Default: auto
Checkpoint auto-resolution order:
checkpoint |
Search Path |
|---|---|
final (default) |
{output_dir}/final/ |
latest |
{output_dir}/checkpoint-latest/ → latest checkpoint-N/ |
best |
{output_dir}/checkpoint-best/ |
Method B: Specify Model Directory Directly
bench:
models:
target:
model_dir: "outputs/run_20260301_120000/final"
device: "cuda:0"
dtype: "float16"
CLI Flag Overrides
# Benchmark immediately after training completes
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000
# Use latest checkpoint (mid-training evaluation)
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000 --checkpoint latest
# Specify path directly
eulerforge bench --preset configs/bench/sft_local.yml \
--target-model-dir outputs/run_20260301_120000/final
# Dry run to check data
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000 --dry-run
# Specify GPU (load target on GPU 1 when judge uses GPU 0)
eulerforge bench --preset configs/bench/sft_local.yml \
--target-output-dir outputs/run_20260301_120000 --target-device cuda:1
Error Messages (3-line format)
When checkpoint is missing:
Bench Config: No checkpoint found in outputs/run_xxx (checkpoint=final): 'final/' directory missing
Fix: Run eulerforge train first, or use --checkpoint latest/best, or --target-model-dir
See: docs/tutorials/11_bench.md
Mutual exclusion error:
Bench Config: bench.models.target: exactly one of 'model', 'output_dir', 'model_dir' is allowed — found multiple
Fix: Remove all but one: model (API), output_dir (run dir), or model_dir (HF dir)
See: docs/tutorials/11_bench.md
External APIs (OpenAI / Gemini)
You can use OpenAI or Gemini as the judge model:
bench:
models:
judge:
enabled: true
provider: openai # ollama | openai | gemini
model: "gpt-4o"
base_url: "https://api.openai.com/v1"
api_key_env: "OPENAI_API_KEY" # Read API key from environment variable
mode: pointwise
export OPENAI_API_KEY="sk-..."
eulerforge bench --preset configs/bench/sft_with_openai_judge.yml
Output
Terminal Output
============================================================
[Sample 0]
Prompt: Important factors to consider when choosing recording equipment...
Target (qwen3:0.6b):
Important factors to consider when choosing recording equipment...
Judge (pointwise): Target=7/10 Baseline=8/10
[Target] The response covers key factors clearly...
[Baseline] The baseline provides a more structured answer...
============================================================
[Bench Summary]
Task: sft
Samples: 10
Target: qwen3:0.6b
Baseline: qwen3.5:0.8b
Pointwise Target: avg=6.4 min=4 max=9
Pointwise Baseline: avg=7.8 min=6 max=9
============================================================
Saved Files
outputs/bench/
├── bench_results.jsonl # Individual results (prompts, responses, scores)
├── bench_summary.json # Summary statistics
└── bench_resolved_config.json # Snapshot of the configuration used
Provided Example YAMLs
| File | Description |
|---|---|
configs/bench/sft_target_only.yml |
SFT target only |
configs/bench/sft_with_judge.yml |
SFT + pointwise judge |
configs/bench/preference_pairwise.yml |
Preference + pairwise judge |
CLI Options
| Option | Description |
|---|---|
--preset PATH |
Bench YAML spec file (required) |
--set KEY=VALUE |
Configuration override (repeatable) |
--output-dir DIR |
Result output directory |
--validate-only |
Perform config validation only |
--dry-run |
Sample extraction only (no model calls) |
--target-output-dir PATH |
Target local model: training output root directory |
--checkpoint TYPE |
Checkpoint type: final (default) | latest | best |
--target-model-dir PATH |
Target local model: HF save_pretrained directory, specified directly |
--target-device DEVICE |
Target local model device override (e.g., cuda:0, cuda:1, cpu). Default: auto |
Sequential Model Loading (OOM Prevention)
Bench uses up to 3 models (target, baseline, judge). To prevent GPU memory overflow (OOM), models are loaded one at a time, processing all data before unloading.
Phase 1: Load target → Infer all samples → Unload
Phase 2: Load baseline → Infer all samples → Unload
Phase 3: Load judge → Evaluate all samples → Unload
Phase 4: Aggregate results → Same output as before
- No more than 2 models exist in memory simultaneously
- API clients (Ollama/OpenAI/Gemini) do not use GPU resources, so unload is a no-op
- Local HF models (
LocalHFClient) delete model/tokenizer + calltorch.cuda.empty_cache()on unload - Result output (JSONL, terminal, summary) is completely identical to the previous format
Details: cli.md
Ollama Thinking Model Support
Thinking models like Qwen 3.5 return responses from Ollama with an empty content field and the thinking process in the reasoning field.
EulerForge bench directly calls the native API (/api/chat) when using the Ollama provider and sends think: false to disable thinking. This ensures that thinking models also return responses normally in the content field.
- Ollama provider: Uses
/api/chatnative API (supports thinking disable) - OpenAI/Gemini provider: Uses
/chat/completionsOpenAI-compatible API
Even if the config's base_url ends with /v1, the Ollama provider automatically strips /v1 and calls the native API.