15. Loading Models
Overview
eulerforge.load_model() is a public API that lets you load EulerForge-trained checkpoints in a single line. It automatically detects the training strategy (plain LoRA, MoE, MixtureLoRA) and selects the appropriate loading method.
- Suitable for: Post-training inference, model inspection, batch generation, quantized loading
- Compatible strategies: dense_lora, moe_expert_lora, mixture_lora, none (all auto-detected)
- Reference examples:
examples/load_and_chat.py,examples/load_quantized.py
Prerequisites
- EulerForge installed
- A trained checkpoint (run_dir or checkpoint_dir)
- (Optional) int4/int8 quantization:
pip install bitsandbytes
1. Basic Usage
from eulerforge import load_model
result = load_model("outputs/run_20260311_163425")
# Return value
result.model # nn.Module (eval mode)
result.tokenizer # PreTrainedTokenizer
result.metadata # ModelMetadata
load_model() auto-detects the path type:
| Path Type | Detection Criterion | Example |
|---|---|---|
run_dir |
resolved_config.json exists |
outputs/run_20260311_163425 |
checkpoint_dir |
config.json exists |
outputs/run_20260311_163425/final |
2. Checkpoint Selection
When specifying a run_dir, you can select a checkpoint with the checkpoint parameter.
# final (default)
result = load_model("outputs/run_20260311_163425", checkpoint="final")
# best checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="best")
# latest checkpoint
result = load_model("outputs/run_20260311_163425", checkpoint="latest")
| checkpoint | Subdirectory |
|---|---|
"final" |
{run_dir}/final/ |
"best" |
{run_dir}/checkpoint-best/ |
"latest" |
{run_dir}/checkpoint-latest/ |
If you specify a checkpoint_dir directly, the checkpoint parameter is ignored:
result = load_model("outputs/run_20260311_163425/final")
3. Load Precision (Quantization)
Use the load_precision parameter to specify the model's loading precision.
# 4-bit quantization (NF4 + double quantization)
result = load_model("outputs/run_20260311_163425", load_precision="int4")
# 8-bit quantization
result = load_model("outputs/run_20260311_163425", load_precision="int8")
# bfloat16 (Ampere+ GPUs)
result = load_model("outputs/run_20260311_163425", load_precision="bf16")
Supported Precisions
| Value | Description | Requirements |
|---|---|---|
None |
Model's original precision (default) | -- |
"fp32" |
float32 | -- |
"fp16" |
float16 | GPU recommended |
"bf16" |
bfloat16 | Ampere+ GPU |
"int8" |
8-bit quantization | bitsandbytes |
"int4" |
4-bit NF4 quantization | bitsandbytes |
int4 Quantization Configuration
When int4 is specified, the following BitsAndBytesConfig is automatically applied:
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
4. Automatic Loading by Training Strategy
load_model() auto-detects the checkpoint's training strategy and loads it using the appropriate method.
| Strategy | Loading Method | Resulting Model |
|---|---|---|
dense_lora |
LoRA merge -> dense | Dense model |
moe_expert_lora |
MoE structure reconstruction | MoE model (structure preserved) |
mixture_lora |
MixtureLoRA reconstruction | MixtureLoRA model (structure preserved) |
none |
Direct from_pretrained() load |
HF model |
User code is identical regardless of strategy:
# Same code for dense_lora, moe_expert_lora, and mixture_lora
result = load_model("outputs/run_20260311_163425")
response = generate(result.model, result.tokenizer, "Hello")
5. Inspecting Metadata
result = load_model("outputs/run_20260311_163425")
m = result.metadata
print(f"Strategy: {m.strategy}") # "dense_lora"
print(f"Backbone: {m.backbone}") # "qwen3"
print(f"Path type: {m.path_type}") # "run_dir"
print(f"Checkpoint: {m.checkpoint_dir}") # absolute path
print(f"Structure preserved: {m.structure_preserved}") # True/False
print(f"Load precision: {m.load_precision}") # "int4" etc. or None
if m.lora_config:
print(f"LoRA r: {m.lora_config['lora_r']}")
print(f"LoRA alpha: {m.lora_config['lora_alpha']}")
ModelMetadata Fields
| Field | Type | Description |
|---|---|---|
strategy |
str |
"dense_lora" | "moe_expert_lora" | "mixture_lora" | "none" |
backbone |
str |
"qwen3" | "llama" | "gemma3" | "" |
path_type |
str |
"run_dir" | "checkpoint_dir" |
checkpoint_dir |
str |
Absolute path to the actual checkpoint directory |
lora_config |
dict \| None |
{"lora_r": int, "lora_alpha": float} |
structure_preserved |
bool |
Whether MoE/MixtureLoRA structure is preserved |
load_precision |
str \| None |
Load precision |
6. Inference Examples
6.1 Chat Generation
import torch
from eulerforge import load_model
result = load_model("outputs/run_20260311_163425", load_precision="int4")
messages = [{"role": "user", "content": "What is the capital of South Korea?"}]
if getattr(result.tokenizer, "chat_template", None):
text = result.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
else:
text = "[USER]\nWhat is the capital of South Korea?\n\n[ASSISTANT]\n"
inputs = result.tokenizer(text, return_tensors="pt")
inputs = {k: v.to(result.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = result.model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
pad_token_id=result.tokenizer.eos_token_id,
)
input_len = inputs["input_ids"].shape[1]
response = result.tokenizer.decode(
outputs[0][input_len:], skip_special_tokens=True
).strip()
print(response)
6.2 Memory Usage Comparison
from eulerforge import load_model
# Original precision
r1 = load_model("outputs/run_20260311_163425", load_precision="fp16")
mem1 = sum(p.nelement() * p.element_size() for p in r1.model.parameters())
# 4-bit quantization
r2 = load_model("outputs/run_20260311_163425", load_precision="int4")
mem2 = sum(p.nelement() * p.element_size() for p in r2.model.parameters())
print(f"fp16: {mem1 / 1024**2:.1f} MB")
print(f"int4: {mem2 / 1024**2:.1f} MB")
print(f"Savings: {(1 - mem2/mem1) * 100:.0f}%")
7. Example Scripts
The examples/ directory contains runnable examples:
| Example | Description | Command |
|---|---|---|
load_and_chat.py |
Basic chat | python examples/load_and_chat.py <path> |
load_quantized.py |
Quantized loading | python examples/load_quantized.py <path> --precision int4 |
load_and_inspect.py |
Metadata inspection | python examples/load_and_inspect.py <path> |
load_batch_inference.py |
Batch inference | python examples/load_batch_inference.py <path> |
8. API Reference
from eulerforge import load_model, LoadedModel, ModelMetadata
result: LoadedModel = load_model(
path, # run_dir or checkpoint_dir path
*,
checkpoint="final", # "final" | "best" | "latest"
device="auto", # "auto" | "cpu" | "cuda" | "cuda:0"
dtype="auto", # "auto" | "float32" | "bfloat16"
load_precision=None, # None | "fp32" | "fp16" | "bf16" | "int8" | "int4"
)
Error Handling
All errors use a 3-line format with ValueError:
Cannot identify EulerForge checkpoint: ...
Fix: Specify a valid run_dir or checkpoint_dir path.
See: docs/tutorials/11_bench.md
References
- Spec:
docs/fixtures/specs/loader_spec.md - CLI:
do../cli.mdsection Python API - Tests:
tests/test_loader.py(30 tests)