Home > EulerStack > Tutorials > 6. Sanity Training Loop

6. Sanity Training Loop

⚠️ Scope warning — use other tools for real training

EulerStack is not a training framework. The "sanity training" described here is a regression smoke test — a 30-second CPU/GPU check that a new mixer or primitive hasn't structurally broken the architecture. That is its only purpose.

Never use this helper for real LLM training — hours-to-weeks pretraining, SFT, or RLHF — because of how much is missing here.

What is missing from EulerStack sanity training

Feature Why it's missing
Distributed training (DDP / FSDP / TP / PP) Needs a dedicated launcher → accelerate / Megatron
Gradient accumulation, a real mixed-precision scaler No production GradScaler integration
LR schedulers (warmup + cosine, WSD, …) Just a single AdamW at fixed LR
Gradient clipping, weight-decay tuning No production recipe
Checkpointing, resume, best-model selection 20 steps and it's done
Eval loop, validation loss, perplexity tracking Minimal logging only
Sampling weights, curriculum, multi-dataset mixing Single JSONL only
RLHF (PPO / DPO / GRPO), SFT-specific formatting Entirely absent
WandB / TensorBoard / MLflow integration None
HPO (Optuna, Ray Tune) None
Model merging, pruning, quantization-aware training None

The point: "training runs" means two very different things.

This tutorial is purely the former.

An HF model exported from EulerStack YAML is a standard HF PreTrainedModel, so it plugs into every tool below. Pick by use case:

Use case Tool One-liner
Pretraining (single~few-node) HF Trainer, Composer, Levanter DDP / FSDP / mixed precision
Pretraining (HPC, 100s~1000s of GPUs) TorchTitan, Megatron-LM, GPT-NeoX 3D parallelism, activation checkpointing, ZeRO
SFT / fine-tune Axolotl, LLaMA-Factory, torchtune chat templates, packing, QLoRA
Parameter-efficient PEFT, bitsandbytes, Unsloth LoRA / QLoRA / IA³ / adapters
RLHF / preference TRL, OpenRLHF, Axolotl-GRPO PPO / DPO / IPO / GRPO / RLOO
Real HPO Optuna, Ray Tune Bayesian / multi-objective
Experiment tracking Weights & Biases, MLflow, ClearML metrics + artifacts + run compare

Recommended workflow:

EulerStack YAML spec
    │  compile → save_pretrained
    ▼
HF model directory (standard)
    │  AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True)
    ▼
Axolotl / Megatron / HF Trainer / TRL / torchtune / … (pick one)
    │
    ▼
trained model

EulerStack owns only the top box.

See Tutorial 0: Where EulerStack Fits for the full positioning argument.


What follows — the internal sanity regression helper

The remainder of this tutorial documents a smoke test used only by this repository's own CI. It exists to answer three questions quickly whenever you add a new mixer or primitive:

  1. Does the structure link together? — does forward run with consistent shapes?
  2. Are grads finite? — no NaN / Inf after backward?
  3. Does loss show a descent signal? — within 20 steps, is there a downward trend (need not be monotone)?

All three green means "this YAML is eligible to be fed to an HF-compatible trainer." Anything beyond that — real quality, convergence metrics, scale-up stability — is not what this loop answers.

Two Sanity Modes

EulerStack provides two sanity modes. Pick based on context.

CPU-Friendly Fast Sanity

For CI or machines without a GPU. It uses a sanity variant of the model — same topology but parameter count reduced to ~50K — so the run finishes in a few seconds on CPU.

from eulerstack.training.sanity import build_sanity_model, run_sanity_training
from eulerstack.data.prepare import TokenizedDataset
import torch

# Same topology, small scale
model = build_sanity_model(
    vocab_size=256, d_model=64, n_heads=4, n_layers=4, max_seq_len=64,
)

# Simple pattern data (no real text needed)
data = TokenizedDataset(torch.arange(0, 32).unsqueeze(0).repeat(128, 1))

result = run_sanity_training(
    model, data, batch_size=16, max_steps=50, seed=42,
)
print(result.summary())

Healthy output looks like:

Sanity training: PASS
  Steps: 50
  Initial loss: 5.5432
  Final loss: 3.1201
  Loss decreased: True

This mode is a smoke test that confirms the architecture implementation is not fundamentally broken. Run it first whenever you add or modify a mixer or block implementation.

GPU E2E Sanity (Real Presets)

Builds a real preset model (0.8B+) on GPU and trains briefly on dolma data. This is closest to a production setup.

import yaml, torch
from eulerstack.ir.normalizer import normalize_to_ir
from eulerstack.training.sanity import build_model_from_ir, run_e2e_training
from eulerstack.data.prepare import tokenize_jsonl, TokenizedDataset

# Load preset
with open("configs/presets/arch_advanced_jamba.yml") as f:
    ir = normalize_to_ir(yaml.safe_load(f))

# Build model on GPU
model = build_model_from_ir(ir, device="cuda:1", dtype=torch.bfloat16)

# Prepare data (cached)
raw = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,
)
dataset = TokenizedDataset(raw.input_ids, vocab_size=ir.model.vocab_size)

# Train 20 steps
result = run_e2e_training(
    model, dataset,
    batch_size=2, lr=1e-4, max_steps=20,
    device="cuda:1", use_amp=True, seed=42,
)
print(result.summary())

If this mode passes, the preset is ready for full training.

Running via pytest (Gated E2E Tests)

Both sanity modes can also be invoked as pytest tests, with environment variables gating them. This is useful in CI and regression testing during development.

# All 8 llm_ presets ≤ 2B (simple / mistral / jamba / moe × 2 sizes)
RUN_LLM_E2E=1 python -m pytest tests/integration/llm_e2e/ --llm-presets=all -v -s

# All 17 arch_ presets
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ --arch-presets=all -v -s

# All 6 arch_expert_*_mini presets
RUN_EXPERT_MINI_E2E=1 python -m pytest tests/integration/expert_mini_e2e/ -v -s

# Custom steps and learning rate
RUN_ARCH_E2E=1 python -m pytest tests/integration/arch_e2e/ \
    --arch-steps=50 --arch-lr=1e-4 -v -s

# Specific GPU (default is cuda:1)
RUN_ARCH_E2E=1 EULERSTACK_DEVICE=cuda:0 python -m pytest tests/integration/arch_e2e/ -v -s

Each gate variable (RUN_LLM_E2E, RUN_ARCH_E2E, RUN_EXPERT_MINI_E2E) must be set for the corresponding tests to run. Without it, tests are marked SKIPPED and the reason message tells you which variable to set.

Detailed per-suite documentation:

Full Preset vs Sanity Variant

The two modes compared at a glance:

Aspect Full preset Sanity variant
d_model 1024–2048 64
vocab_size 32000 256
Parameters 0.8B–2B+ ~50K
Hardware GPU required CPU, < 5 seconds
Purpose Production verification Architecture smoke test

A reasonable development cycle: run CPU sanity after code changes, run GPU E2E sanity before committing, then launch full training.

After sanity passes — the path to real training

Once sanity is green, EulerStack's job is done. From here on you are in the territory of external training tools.

# Step 1: export the EulerStack YAML as a standard HF directory
eulerstack compile \
    --preset configs/presets/arch_advanced_jamba.yml \
    --output-dir ./my_model

# Step 2: from here on, use an external trainer — a few examples:

# (A) Axolotl for SFT
accelerate launch -m axolotl.cli.train axolotl_config.yaml
# axolotl_config.yaml needs `base_model: ./my_model`

# (B) HF Trainer for a simple pretrain
python train_with_hf_trainer.py \
    --model_name_or_path ./my_model \
    --dataset_name wikitext \
    --dataset_config wikitext-2-raw-v1 \
    --do_train --per_device_train_batch_size 2

# (C) TRL for DPO
python train_dpo.py --model ./my_model --dataset anthropic/hh-rlhf

# (D) TorchTitan / Megatron-LM — follow each tool's official distributed recipe

The exported artifact is a standard HuggingFace PreTrainedModel so it drops into PEFT, TRL, vLLM, TGI, SGLang, TensorRT-LLM — anything that consumes the HF contract.

Next steps