15. Loading Models

Overview

eulerforge.load_model() is a public API that lets you load EulerForge-trained checkpoints in a single line. It automatically detects the training strategy (plain LoRA, MoE, MixtureLoRA) and selects the appropriate loading method.

Suitable for: Post-training inference, model inspection, batch generation, quantized loading
Compatible strategies: dense_lora, moe_expert_lora, mixture_lora, none (all auto-detected)
Reference examples: examples/load_and_chat.py, examples/load_quantized.py

Prerequisites

EulerForge installed
A trained checkpoint (run_dir or checkpoint_dir)
(Optional) int4/int8 quantization: pip install bitsandbytes

1. Basic Usage

from eulerforge import load_model

result = load_model("outputs/run_20260311_163425")

# Return value
result.model       # nn.Module (eval mode)
result.tokenizer   # PreTrainedTokenizer
result.metadata    # ModelMetadata

load_model() auto-detects the path type:

Path Type	Detection Criterion	Example
`run_dir`	`resolved_config.json` exists	`outputs/run_20260311_163425`
`checkpoint_dir`	`config.json` exists	`outputs/run_20260311_163425/final`

2. Checkpoint Selection

When specifying a run_dir, you can select a checkpoint with the checkpoint parameter.

# final (default)
result = load_model("outputs/run_20260311_163425", checkpoint="final")

# best checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="best")

# latest checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="latest")

checkpoint	Subdirectory
`"final"`	`{run_dir}/final/`
`"best"`	`{run_dir}/checkpoint-best/`
`"latest"`	`{run_dir}/checkpoint-latest/`

If you specify a checkpoint_dir directly, the checkpoint parameter is ignored:

result = load_model("outputs/run_20260311_163425/final")

3. Load Precision (Quantization)

Use the load_precision parameter to specify the model's loading precision.

# 4-bit quantization (NF4 + double quantization)
result = load_model("outputs/run_20260311_163425", load_precision="int4")

# 8-bit quantization
result = load_model("outputs/run_20260311_163425", load_precision="int8")

# bfloat16 (Ampere+ GPUs)
result = load_model("outputs/run_20260311_163425", load_precision="bf16")

Supported Precisions

Value	Description	Requirements
`None`	Model's original precision (default)	--
`"fp32"`	float32	--
`"fp16"`	float16	GPU recommended
`"bf16"`	bfloat16	Ampere+ GPU
`"int8"`	8-bit quantization	`bitsandbytes`
`"int4"`	4-bit NF4 quantization	`bitsandbytes`

int4 Quantization Configuration

When int4 is specified, the following BitsAndBytesConfig is automatically applied:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

4. Automatic Loading by Training Strategy

load_model() auto-detects the checkpoint's training strategy and loads it using the appropriate method.

Strategy	Loading Method	Resulting Model
`dense_lora`	LoRA merge -> dense	Dense model
`moe_expert_lora`	MoE structure reconstruction	MoE model (structure preserved)
`mixture_lora`	MixtureLoRA reconstruction	MixtureLoRA model (structure preserved)
`none`	Direct `from_pretrained()` load	HF model

User code is identical regardless of strategy:

# Same code for dense_lora, moe_expert_lora, and mixture_lora
result = load_model("outputs/run_20260311_163425")
response = generate(result.model, result.tokenizer, "Hello")

5. Inspecting Metadata

result = load_model("outputs/run_20260311_163425")
m = result.metadata

print(f"Strategy: {m.strategy}")              # "dense_lora"
print(f"Backbone: {m.backbone}")              # "qwen3"
print(f"Path type: {m.path_type}")            # "run_dir"
print(f"Checkpoint: {m.checkpoint_dir}")       # absolute path
print(f"Structure preserved: {m.structure_preserved}")  # True/False
print(f"Load precision: {m.load_precision}")   # "int4" etc. or None

if m.lora_config:
    print(f"LoRA r: {m.lora_config['lora_r']}")
    print(f"LoRA alpha: {m.lora_config['lora_alpha']}")

ModelMetadata Fields

Field	Type	Description
`strategy`	`str`	`"dense_lora"` \| `"moe_expert_lora"` \| `"mixture_lora"` \| `"none"`
`backbone`	`str`	`"qwen3"` \| `"llama"` \| `"gemma3"` \| `""`
`path_type`	`str`	`"run_dir"` \| `"checkpoint_dir"`
`checkpoint_dir`	`str`	Absolute path to the actual checkpoint directory
`lora_config`	`dict \\| None`	`{"lora_r": int, "lora_alpha": float}`
`structure_preserved`	`bool`	Whether MoE/MixtureLoRA structure is preserved
`load_precision`	`str \\| None`	Load precision

6. Inference Examples

6.1 Chat Generation

import torch
from eulerforge import load_model

result = load_model("outputs/run_20260311_163425", load_precision="int4")

messages = [{"role": "user", "content": "What is the capital of South Korea?"}]

if getattr(result.tokenizer, "chat_template", None):
    text = result.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
else:
    text = "[USER]\nWhat is the capital of South Korea?\n\n[ASSISTANT]\n"

inputs = result.tokenizer(text, return_tensors="pt")
inputs = {k: v.to(result.model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = result.model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        pad_token_id=result.tokenizer.eos_token_id,
    )

input_len = inputs["input_ids"].shape[1]
response = result.tokenizer.decode(
    outputs[0][input_len:], skip_special_tokens=True
).strip()
print(response)

6.2 Memory Usage Comparison

from eulerforge import load_model

# Original precision
r1 = load_model("outputs/run_20260311_163425", load_precision="fp16")
mem1 = sum(p.nelement() * p.element_size() for p in r1.model.parameters())

# 4-bit quantization
r2 = load_model("outputs/run_20260311_163425", load_precision="int4")
mem2 = sum(p.nelement() * p.element_size() for p in r2.model.parameters())

print(f"fp16: {mem1 / 1024**2:.1f} MB")
print(f"int4: {mem2 / 1024**2:.1f} MB")
print(f"Savings: {(1 - mem2/mem1) * 100:.0f}%")

7. Example Scripts

The examples/ directory contains runnable examples:

Example	Description	Command
`load_and_chat.py`	Basic chat	`python examples/load_and_chat.py <path>`
`load_quantized.py`	Quantized loading	`python examples/load_quantized.py <path> --precision int4`
`load_and_inspect.py`	Metadata inspection	`python examples/load_and_inspect.py <path>`
`load_batch_inference.py`	Batch inference	`python examples/load_batch_inference.py <path>`

8. API Reference

from eulerforge import load_model, LoadedModel, ModelMetadata

result: LoadedModel = load_model(
    path,                      # run_dir or checkpoint_dir path
    *,
    checkpoint="final",        # "final" | "best" | "latest"
    device="auto",             # "auto" | "cpu" | "cuda" | "cuda:0"
    dtype="auto",              # "auto" | "float32" | "bfloat16"
    load_precision=None,       # None | "fp32" | "fp16" | "bf16" | "int8" | "int4"
)

Error Handling

All errors use a 3-line format with ValueError:

Cannot identify EulerForge checkpoint: ...
Fix: Specify a valid run_dir or checkpoint_dir path.
See: docs/tutorials/11_bench.md

References

Spec: docs/fixtures/specs/loader_spec.md
CLI: do../cli.md section Python API
Tests: tests/test_loader.py (30 tests)

← Prev 14. LoRA Handoff Scheduling 16. HuggingFace Export Next →