Home > EulerForge > Tutorials > 15. Loading Models

15. Loading Models

Overview

eulerforge.load_model() is a public API that lets you load EulerForge-trained checkpoints in a single line. It automatically detects the training strategy (plain LoRA, MoE, MixtureLoRA) and selects the appropriate loading method.


Prerequisites


1. Basic Usage

from eulerforge import load_model

result = load_model("outputs/run_20260311_163425")

# Return value
result.model       # nn.Module (eval mode)
result.tokenizer   # PreTrainedTokenizer
result.metadata    # ModelMetadata

load_model() auto-detects the path type:

Path Type Detection Criterion Example
run_dir resolved_config.json exists outputs/run_20260311_163425
checkpoint_dir config.json exists outputs/run_20260311_163425/final

2. Checkpoint Selection

When specifying a run_dir, you can select a checkpoint with the checkpoint parameter.

# final (default)
result = load_model("outputs/run_20260311_163425", checkpoint="final")

# best checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="best")

# latest checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="latest")
checkpoint Subdirectory
"final" {run_dir}/final/
"best" {run_dir}/checkpoint-best/
"latest" {run_dir}/checkpoint-latest/

If you specify a checkpoint_dir directly, the checkpoint parameter is ignored:

result = load_model("outputs/run_20260311_163425/final")

3. Load Precision (Quantization)

Use the load_precision parameter to specify the model's loading precision.

# 4-bit quantization (NF4 + double quantization)
result = load_model("outputs/run_20260311_163425", load_precision="int4")

# 8-bit quantization
result = load_model("outputs/run_20260311_163425", load_precision="int8")

# bfloat16 (Ampere+ GPUs)
result = load_model("outputs/run_20260311_163425", load_precision="bf16")

Supported Precisions

Value Description Requirements
None Model's original precision (default) --
"fp32" float32 --
"fp16" float16 GPU recommended
"bf16" bfloat16 Ampere+ GPU
"int8" 8-bit quantization bitsandbytes
"int4" 4-bit NF4 quantization bitsandbytes

int4 Quantization Configuration

When int4 is specified, the following BitsAndBytesConfig is automatically applied:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

4. Automatic Loading by Training Strategy

load_model() auto-detects the checkpoint's training strategy and loads it using the appropriate method.

Strategy Loading Method Resulting Model
dense_lora LoRA merge -> dense Dense model
moe_expert_lora MoE structure reconstruction MoE model (structure preserved)
mixture_lora MixtureLoRA reconstruction MixtureLoRA model (structure preserved)
none Direct from_pretrained() load HF model

User code is identical regardless of strategy:

# Same code for dense_lora, moe_expert_lora, and mixture_lora
result = load_model("outputs/run_20260311_163425")
response = generate(result.model, result.tokenizer, "Hello")

5. Inspecting Metadata

result = load_model("outputs/run_20260311_163425")
m = result.metadata

print(f"Strategy: {m.strategy}")              # "dense_lora"
print(f"Backbone: {m.backbone}")              # "qwen3"
print(f"Path type: {m.path_type}")            # "run_dir"
print(f"Checkpoint: {m.checkpoint_dir}")       # absolute path
print(f"Structure preserved: {m.structure_preserved}")  # True/False
print(f"Load precision: {m.load_precision}")   # "int4" etc. or None

if m.lora_config:
    print(f"LoRA r: {m.lora_config['lora_r']}")
    print(f"LoRA alpha: {m.lora_config['lora_alpha']}")

ModelMetadata Fields

Field Type Description
strategy str "dense_lora" | "moe_expert_lora" | "mixture_lora" | "none"
backbone str "qwen3" | "llama" | "gemma3" | ""
path_type str "run_dir" | "checkpoint_dir"
checkpoint_dir str Absolute path to the actual checkpoint directory
lora_config dict \| None {"lora_r": int, "lora_alpha": float}
structure_preserved bool Whether MoE/MixtureLoRA structure is preserved
load_precision str \| None Load precision

6. Inference Examples

6.1 Chat Generation

import torch
from eulerforge import load_model

result = load_model("outputs/run_20260311_163425", load_precision="int4")

messages = [{"role": "user", "content": "What is the capital of South Korea?"}]

if getattr(result.tokenizer, "chat_template", None):
    text = result.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
else:
    text = "[USER]\nWhat is the capital of South Korea?\n\n[ASSISTANT]\n"

inputs = result.tokenizer(text, return_tensors="pt")
inputs = {k: v.to(result.model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = result.model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        pad_token_id=result.tokenizer.eos_token_id,
    )

input_len = inputs["input_ids"].shape[1]
response = result.tokenizer.decode(
    outputs[0][input_len:], skip_special_tokens=True
).strip()
print(response)

6.2 Memory Usage Comparison

from eulerforge import load_model

# Original precision
r1 = load_model("outputs/run_20260311_163425", load_precision="fp16")
mem1 = sum(p.nelement() * p.element_size() for p in r1.model.parameters())

# 4-bit quantization
r2 = load_model("outputs/run_20260311_163425", load_precision="int4")
mem2 = sum(p.nelement() * p.element_size() for p in r2.model.parameters())

print(f"fp16: {mem1 / 1024**2:.1f} MB")
print(f"int4: {mem2 / 1024**2:.1f} MB")
print(f"Savings: {(1 - mem2/mem1) * 100:.0f}%")

7. Example Scripts

The examples/ directory contains runnable examples:

Example Description Command
load_and_chat.py Basic chat python examples/load_and_chat.py <path>
load_quantized.py Quantized loading python examples/load_quantized.py <path> --precision int4
load_and_inspect.py Metadata inspection python examples/load_and_inspect.py <path>
load_batch_inference.py Batch inference python examples/load_batch_inference.py <path>

8. API Reference

from eulerforge import load_model, LoadedModel, ModelMetadata

result: LoadedModel = load_model(
    path,                      # run_dir or checkpoint_dir path
    *,
    checkpoint="final",        # "final" | "best" | "latest"
    device="auto",             # "auto" | "cpu" | "cuda" | "cuda:0"
    dtype="auto",              # "auto" | "float32" | "bfloat16"
    load_precision=None,       # None | "fp32" | "fp16" | "bf16" | "int8" | "int4"
)

Error Handling

All errors use a 3-line format with ValueError:

Cannot identify EulerForge checkpoint: ...
Fix: Specify a valid run_dir or checkpoint_dir path.
See: docs/tutorials/11_bench.md

References