6. Sanity Training Loop

⚠️ Scope warning — use other tools for real training

EulerStack is not a training framework. The "sanity training" described here is a regression smoke test — a 30-second CPU/GPU check that a new mixer or primitive hasn't structurally broken the architecture. That is its only purpose.

Never use this helper for real LLM training — hours-to-weeks pretraining, SFT, or RLHF — because of how much is missing here.

What is missing from EulerStack sanity training

Feature	Why it's missing
Distributed training (DDP / FSDP / TP / PP)	Needs a dedicated launcher → `accelerate` / Megatron
Gradient accumulation, a real mixed-precision scaler	No production `GradScaler` integration
LR schedulers (warmup + cosine, WSD, …)	Just a single AdamW at fixed LR
Gradient clipping, weight-decay tuning	No production recipe
Checkpointing, resume, best-model selection	20 steps and it's done
Eval loop, validation loss, perplexity tracking	Minimal logging only
Sampling weights, curriculum, multi-dataset mixing	Single JSONL only
RLHF (PPO / DPO / GRPO), SFT-specific formatting	Entirely absent
WandB / TensorBoard / MLflow integration	None
HPO (Optuna, Ray Tune)	None
Model merging, pruning, quantization-aware training	None

The point: "training runs" means two very different things.

What EulerStack owns — "the structure declared by YAML actually forwards/backwards under PyTorch" = structural verification
What dedicated training frameworks own — "this model achieves quality on real benchmarks" = real training

This tutorial is purely the former.

Recommended tools for real training

An HF model exported from EulerStack YAML is a standard HF PreTrainedModel, so it plugs into every tool below. Pick by use case:

Use case	Tool	One-liner
Pretraining (single~few-node)	HF Trainer, Composer, Levanter	DDP / FSDP / mixed precision
Pretraining (HPC, 100s~1000s of GPUs)	TorchTitan, Megatron-LM, GPT-NeoX	3D parallelism, activation checkpointing, ZeRO
SFT / fine-tune	Axolotl, LLaMA-Factory, torchtune	chat templates, packing, QLoRA
Parameter-efficient	PEFT, bitsandbytes, Unsloth	LoRA / QLoRA / IA³ / adapters
RLHF / preference	TRL, OpenRLHF, Axolotl-GRPO	PPO / DPO / IPO / GRPO / RLOO
Real HPO	Optuna, Ray Tune	Bayesian / multi-objective
Experiment tracking	Weights & Biases, MLflow, ClearML	metrics + artifacts + run compare

Recommended workflow:

EulerStack YAML spec
    │  compile → save_pretrained
    ▼
HF model directory (standard)
    │  AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True)
    ▼
Axolotl / Megatron / HF Trainer / TRL / torchtune / … (pick one)
    │
    ▼
trained model

EulerStack owns only the top box.

See Tutorial 0: Where EulerStack Fits for the full positioning argument.

What follows — the internal sanity regression helper

The remainder of this tutorial documents a smoke test used only by this repository's own CI. It exists to answer three questions quickly whenever you add a new mixer or primitive:

Does the structure link together? — does forward run with consistent shapes?
Are grads finite? — no NaN / Inf after backward?
Does loss show a descent signal? — within 20 steps, is there a downward trend (need not be monotone)?

All three green means "this YAML is eligible to be fed to an HF-compatible trainer." Anything beyond that — real quality, convergence metrics, scale-up stability — is not what this loop answers.

Two Sanity Modes

EulerStack provides two sanity modes. Pick based on context.

CPU-Friendly Fast Sanity

For CI or machines without a GPU. It uses a sanity variant of the model — same topology but parameter count reduced to ~50K — so the run finishes in a few seconds on CPU.

from eulerstack.training.sanity import build_sanity_model, run_sanity_training
from eulerstack.data.prepare import TokenizedDataset
import torch

# Same topology, small scale
model = build_sanity_model(
    vocab_size=256, d_model=64, n_heads=4, n_layers=4, max_seq_len=64,
)

# Simple pattern data (no real text needed)
data = TokenizedDataset(torch.arange(0, 32).unsqueeze(0).repeat(128, 1))

result = run_sanity_training(
    model, data, batch_size=16, max_steps=50, seed=42,
)
print(result.summary())

Healthy output looks like:

Sanity training: PASS
  Steps: 50
  Initial loss: 5.5432
  Final loss: 3.1201
  Loss decreased: True

This mode is a smoke test that confirms the architecture implementation is not fundamentally broken. Run it first whenever you add or modify a mixer or block implementation.

GPU E2E Sanity (Real Presets)

Builds a real preset model (0.8B+) on GPU and trains briefly on dolma data. This is closest to a production setup.

import yaml, torch
from eulerstack.ir.normalizer import normalize_to_ir
from eulerstack.training.sanity import build_model_from_ir, run_e2e_training
from eulerstack.data.prepare import tokenize_jsonl, TokenizedDataset

# Load preset
with open("configs/presets/arch_advanced_jamba.yml") as f:
    ir = normalize_to_ir(yaml.safe_load(f))

# Build model on GPU
model = build_model_from_ir(ir, device="cuda:1", dtype=torch.bfloat16)

# Prepare data (cached)
raw = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,
)
dataset = TokenizedDataset(raw.input_ids, vocab_size=ir.model.vocab_size)

# Train 20 steps
result = run_e2e_training(
    model, dataset,
    batch_size=2, lr=1e-4, max_steps=20,
    device="cuda:1", use_amp=True, seed=42,
)
print(result.summary())

If this mode passes, the preset is ready for full training.

Running via pytest (Gated E2E Tests)

Both sanity modes can also be invoked as pytest tests, with environment variables gating them. This is useful in CI and regression testing during development.

# All 8 llm_ presets ≤ 2B (simple / mistral / jamba / moe × 2 sizes)
RUN_LLM_E2E=1 python -m pytest tests/integration/llm_e2e/ --llm-presets=all -v -s

# All 17 arch_ presets
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ --arch-presets=all -v -s

# All 6 arch_expert_*_mini presets
RUN_EXPERT_MINI_E2E=1 python -m pytest tests/integration/expert_mini_e2e/ -v -s

# Custom steps and learning rate
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ \
    --arch-steps=50 --arch-lr=1e-4 -v -s

# Specific GPU (default is cuda:1)
RUN_ARCH_E2E=1 EULERSTACK_DEVICE=cuda:0 python -m pytest tests/integration/arch_e2e/ -v -s

Each gate variable (RUN_LLM_E2E, RUN_ARCH_E2E, RUN_EXPERT_MINI_E2E) must be set for the corresponding tests to run. Without it, tests are marked SKIPPED and the reason message tells you which variable to set.

Detailed per-suite documentation:

tests/integration/arch_e2e/ — docs/architectures/tutorials/arch_e2e_guide.md
tests/integration/llm_e2e/ — docs/architectures/tutorials/llm_e2e_guide.md
tests/integration/expert_mini_e2e/ — Mini section in arch_e2e_guide.md

Full Preset vs Sanity Variant

The two modes compared at a glance:

Aspect	Full preset	Sanity variant
`d_model`	1024–2048	64
`vocab_size`	32000	256
Parameters	0.8B–2B+	~50K
Hardware	GPU required	CPU, < 5 seconds
Purpose	Production verification	Architecture smoke test

A reasonable development cycle: run CPU sanity after code changes, run GPU E2E sanity before committing, then launch full training.

After sanity passes — the path to real training

Once sanity is green, EulerStack's job is done. From here on you are in the territory of external training tools.

# Step 1: export the EulerStack YAML as a standard HF directory
eulerstack compile \
    --preset configs/presets/arch_advanced_jamba.yml \
    --output-dir ./my_model

# Step 2: from here on, use an external trainer — a few examples:

# (A) Axolotl for SFT
accelerate launch -m axolotl.cli.train axolotl_config.yaml
# axolotl_config.yaml needs `base_model: ./my_model`

# (B) HF Trainer for a simple pretrain
python train_with_hf_trainer.py \
    --model_name_or_path ./my_model \
    --dataset_name wikitext \
    --dataset_config wikitext-2-raw-v1 \
    --do_train --per_device_train_batch_size 2

# (C) TRL for DPO
python train_dpo.py --model ./my_model --dataset anthropic/hh-rlhf

# (D) TorchTitan / Megatron-LM — follow each tool's official distributed recipe

The exported artifact is a standard HuggingFace PreTrainedModel so it drops into PEFT, TRL, vLLM, TGI, SGLang, TensorRT-LLM — anything that consumes the HF contract.

Next steps

Tutorial 0: Where EulerStack Fits — why training is not included and where the boundary to other tools sits
Tutorial 2: Use Presets — revisit the catalogue
Tutorial 4: Compile and Explain — HF export in detail
Tutorial 7: arch walkthrough — beginner→expert presets
Tutorial 10: Paper → YAML — port four papers into YAML
examples/01_compile_and_export.py — full export pipeline script
examples/02_load_and_generate.py — load and generate text

← Prev 5. Prepare Training Data 7. Skill-Level Architecture Walkthrough Next →