6. Sanity Training Loop
⚠️ Scope warning — use other tools for real training
EulerStack is not a training framework. The "sanity training" described here is a regression smoke test — a 30-second CPU/GPU check that a new mixer or primitive hasn't structurally broken the architecture. That is its only purpose.
Never use this helper for real LLM training — hours-to-weeks pretraining, SFT, or RLHF — because of how much is missing here.
What is missing from EulerStack sanity training
| Feature | Why it's missing |
|---|---|
| Distributed training (DDP / FSDP / TP / PP) | Needs a dedicated launcher → accelerate / Megatron |
| Gradient accumulation, a real mixed-precision scaler | No production GradScaler integration |
| LR schedulers (warmup + cosine, WSD, …) | Just a single AdamW at fixed LR |
| Gradient clipping, weight-decay tuning | No production recipe |
| Checkpointing, resume, best-model selection | 20 steps and it's done |
| Eval loop, validation loss, perplexity tracking | Minimal logging only |
| Sampling weights, curriculum, multi-dataset mixing | Single JSONL only |
| RLHF (PPO / DPO / GRPO), SFT-specific formatting | Entirely absent |
| WandB / TensorBoard / MLflow integration | None |
| HPO (Optuna, Ray Tune) | None |
| Model merging, pruning, quantization-aware training | None |
The point: "training runs" means two very different things.
- What EulerStack owns — "the structure declared by YAML actually forwards/backwards under PyTorch" = structural verification
- What dedicated training frameworks own — "this model achieves quality on real benchmarks" = real training
This tutorial is purely the former.
Recommended tools for real training
An HF model exported from EulerStack YAML is a standard HF
PreTrainedModel, so it plugs into every tool below. Pick by use case:
| Use case | Tool | One-liner |
|---|---|---|
| Pretraining (single~few-node) | HF Trainer, Composer, Levanter | DDP / FSDP / mixed precision |
| Pretraining (HPC, 100s~1000s of GPUs) | TorchTitan, Megatron-LM, GPT-NeoX | 3D parallelism, activation checkpointing, ZeRO |
| SFT / fine-tune | Axolotl, LLaMA-Factory, torchtune | chat templates, packing, QLoRA |
| Parameter-efficient | PEFT, bitsandbytes, Unsloth | LoRA / QLoRA / IA³ / adapters |
| RLHF / preference | TRL, OpenRLHF, Axolotl-GRPO | PPO / DPO / IPO / GRPO / RLOO |
| Real HPO | Optuna, Ray Tune | Bayesian / multi-objective |
| Experiment tracking | Weights & Biases, MLflow, ClearML | metrics + artifacts + run compare |
Recommended workflow:
EulerStack YAML spec
│ compile → save_pretrained
▼
HF model directory (standard)
│ AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True)
▼
Axolotl / Megatron / HF Trainer / TRL / torchtune / … (pick one)
│
▼
trained model
EulerStack owns only the top box.
See Tutorial 0: Where EulerStack Fits for the full positioning argument.
What follows — the internal sanity regression helper
The remainder of this tutorial documents a smoke test used only by this repository's own CI. It exists to answer three questions quickly whenever you add a new mixer or primitive:
- Does the structure link together? — does forward run with consistent shapes?
- Are grads finite? — no NaN / Inf after backward?
- Does loss show a descent signal? — within 20 steps, is there a downward trend (need not be monotone)?
All three green means "this YAML is eligible to be fed to an HF-compatible trainer." Anything beyond that — real quality, convergence metrics, scale-up stability — is not what this loop answers.
Two Sanity Modes
EulerStack provides two sanity modes. Pick based on context.
CPU-Friendly Fast Sanity
For CI or machines without a GPU. It uses a sanity variant of the model — same topology but parameter count reduced to ~50K — so the run finishes in a few seconds on CPU.
from eulerstack.training.sanity import build_sanity_model, run_sanity_training
from eulerstack.data.prepare import TokenizedDataset
import torch
# Same topology, small scale
model = build_sanity_model(
vocab_size=256, d_model=64, n_heads=4, n_layers=4, max_seq_len=64,
)
# Simple pattern data (no real text needed)
data = TokenizedDataset(torch.arange(0, 32).unsqueeze(0).repeat(128, 1))
result = run_sanity_training(
model, data, batch_size=16, max_steps=50, seed=42,
)
print(result.summary())
Healthy output looks like:
Sanity training: PASS
Steps: 50
Initial loss: 5.5432
Final loss: 3.1201
Loss decreased: True
This mode is a smoke test that confirms the architecture implementation is not fundamentally broken. Run it first whenever you add or modify a mixer or block implementation.
GPU E2E Sanity (Real Presets)
Builds a real preset model (0.8B+) on GPU and trains briefly on dolma data. This is closest to a production setup.
import yaml, torch
from eulerstack.ir.normalizer import normalize_to_ir
from eulerstack.training.sanity import build_model_from_ir, run_e2e_training
from eulerstack.data.prepare import tokenize_jsonl, TokenizedDataset
# Load preset
with open("configs/presets/arch_advanced_jamba.yml") as f:
ir = normalize_to_ir(yaml.safe_load(f))
# Build model on GPU
model = build_model_from_ir(ir, device="cuda:1", dtype=torch.bfloat16)
# Prepare data (cached)
raw = tokenize_jsonl(
jsonl_path="data/dolma_10k.jsonl",
tokenizer_name="gpt2",
max_seq_len=512,
num_rows=1000,
)
dataset = TokenizedDataset(raw.input_ids, vocab_size=ir.model.vocab_size)
# Train 20 steps
result = run_e2e_training(
model, dataset,
batch_size=2, lr=1e-4, max_steps=20,
device="cuda:1", use_amp=True, seed=42,
)
print(result.summary())
If this mode passes, the preset is ready for full training.
Running via pytest (Gated E2E Tests)
Both sanity modes can also be invoked as pytest tests, with environment variables gating them. This is useful in CI and regression testing during development.
# All 8 llm_ presets ≤ 2B (simple / mistral / jamba / moe × 2 sizes)
RUN_LLM_E2E=1 python -m pytest tests/integration/llm_e2e/ --llm-presets=all -v -s
# All 17 arch_ presets
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ --arch-presets=all -v -s
# All 6 arch_expert_*_mini presets
RUN_EXPERT_MINI_E2E=1 python -m pytest tests/integration/expert_mini_e2e/ -v -s
# Custom steps and learning rate
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ \
--arch-steps=50 --arch-lr=1e-4 -v -s
# Specific GPU (default is cuda:1)
RUN_ARCH_E2E=1 EULERSTACK_DEVICE=cuda:0 python -m pytest tests/integration/arch_e2e/ -v -s
Each gate variable (RUN_LLM_E2E, RUN_ARCH_E2E, RUN_EXPERT_MINI_E2E) must
be set for the corresponding tests to run. Without it, tests are marked
SKIPPED and the reason message tells you which variable to set.
Detailed per-suite documentation:
tests/integration/arch_e2e/—docs/architectures/tutorials/arch_e2e_guide.mdtests/integration/llm_e2e/—docs/architectures/tutorials/llm_e2e_guide.mdtests/integration/expert_mini_e2e/— Mini section inarch_e2e_guide.md
Full Preset vs Sanity Variant
The two modes compared at a glance:
| Aspect | Full preset | Sanity variant |
|---|---|---|
d_model |
1024–2048 | 64 |
vocab_size |
32000 | 256 |
| Parameters | 0.8B–2B+ | ~50K |
| Hardware | GPU required | CPU, < 5 seconds |
| Purpose | Production verification | Architecture smoke test |
A reasonable development cycle: run CPU sanity after code changes, run GPU E2E sanity before committing, then launch full training.
After sanity passes — the path to real training
Once sanity is green, EulerStack's job is done. From here on you are in the territory of external training tools.
# Step 1: export the EulerStack YAML as a standard HF directory
eulerstack compile \
--preset configs/presets/arch_advanced_jamba.yml \
--output-dir ./my_model
# Step 2: from here on, use an external trainer — a few examples:
# (A) Axolotl for SFT
accelerate launch -m axolotl.cli.train axolotl_config.yaml
# axolotl_config.yaml needs `base_model: ./my_model`
# (B) HF Trainer for a simple pretrain
python train_with_hf_trainer.py \
--model_name_or_path ./my_model \
--dataset_name wikitext \
--dataset_config wikitext-2-raw-v1 \
--do_train --per_device_train_batch_size 2
# (C) TRL for DPO
python train_dpo.py --model ./my_model --dataset anthropic/hh-rlhf
# (D) TorchTitan / Megatron-LM — follow each tool's official distributed recipe
The exported artifact is a standard HuggingFace PreTrainedModel so it
drops into PEFT, TRL, vLLM, TGI, SGLang, TensorRT-LLM — anything that
consumes the HF contract.
Next steps
- Tutorial 0: Where EulerStack Fits — why training is not included and where the boundary to other tools sits
- Tutorial 2: Use Presets — revisit the catalogue
- Tutorial 4: Compile and Explain — HF export in detail
- Tutorial 7: arch walkthrough — beginner→expert presets
- Tutorial 10: Paper → YAML — port four papers into YAML
examples/01_compile_and_export.py— full export pipeline scriptexamples/02_load_and_generate.py— load and generate text