17. Scratch Pretraining

This document explains how to perform scratch pretraining on models assembled/exported in HuggingFace format by external tools such as EulerStack, using raw text data.

1. pretrain vs train

Item	`eulerforge train` (fine-tuning)	`eulerforge pretrain` (scratch)
Model	Existing HF model (with weights)	Newly assembled model (initialized weights)
Training method	LoRA/MoE injection + phase schedule	Full parameter causal LM training
Data	instruction/preference JSONL	Raw text (packed chunking)
Injection	dense_lora, mixture_lora, etc.	None (prohibited)
Purpose	Adapting existing models	Architecture validation + foundational language capability training

2. End-to-End Flow

EulerStack (model assembly)     EulerForge (training)
───────────────────────────     ─────────────────────
YAML spec
  ↓ compile + export
HF model_dir ──────────────→ eulerforge pretrain
  (config.json +                 ↓
   model.safetensors +       Pretraining complete
   modeling_*.py)                ↓
                             final/ (HF save_pretrained)
                                 ↓
                             eulerforge train (LoRA fine-tuning)

3. Preparation

3.1 EulerStack Model

You need a model directory exported by EulerStack in HuggingFace format.

/path/to/eulerstack/outputs/full_hybrid_moe/
├── config.json                    # HF config (model_type: eulerstack)
├── configuration_eulerstack.py    # Custom config class
├── modeling_eulerstack.py         # Custom model class
├── model.safetensors              # Model weights
└── generation_config.json

Note: Since this is a custom model with model_type: eulerstack, trust_remote_code: true is required. Any form of EulerStack hybrid model (pure attention, mamba hybrid, tri-mixer, MoE, etc.) is supported.

3.2 Data

A raw text JSONL file is required. Each line must contain a text key.

{"text": "LABOR'S MARTYRS\n\nHaymarket\n1887\n\nSacco and Vanzetti..."}
{"text": "THROUGH THE WALL\n\nBY CLEVELAND MOFFETT..."}

This tutorial uses data/dolma_10k.jsonl (10,000 rows from the Dolma dataset).

3.3 Tokenizer

If the model directory does not contain tokenizer files, specify one separately in the preset. The EulerStack model's config.json contains tokenizer information.

"tokenizer": {"type": "hf", "pretrained": "gpt2", "add_bos": true, "add_eos": true}

-> Specify tokenizer: "gpt2" in the preset.

4. Writing a Preset

configs/presets/pretrain/eulerstack_hybrid_moe.yml:

# -- Device --
device: "cuda:0"             # cuda:0, cuda:1, cpu

# -- Model --
model_dir: "outputs/full_hybrid_moe"
trust_remote_code: true

# -- Tokenizer --
tokenizer: "gpt2"

# -- Data --
data:
  path: "data/dolma_10k.jsonl"
  text_column: "text"
  max_length: 1024
  packing: true              # packed chunking

# -- Training --
training:
  max_steps: 500
  batch_size: 2
  grad_accum_steps: 4        # effective batch = 2 × 4 = 8
  lr: 3.0e-4
  weight_decay: 0.1
  warmup_steps: 50
  max_grad_norm: 1.0
  log_steps: 10
  save_steps: 250
  dtype: "float32"           # Hybrid models (Hyena/RetNet FFT) require float32
  amp: false                 # FFT ops don't support bf16 -> disable AMP
  seed: 42

Preset Key Descriptions

Key	Description	Required
`device`	Training device (`cuda:0`, `cuda:1`, `cpu`)	Default: `cuda`
`model_dir`	HF model directory path	Required
`trust_remote_code`	Allow custom model classes	`true` for EulerStack models
`tokenizer`	HF tokenizer name or path	Required if not in model_dir
`data.path`	Raw text JSONL path	Required
`data.text_column`	Text key in JSONL	Required (default: `text`)
`data.max_length`	Maximum sequence length	Default: 2048
`data.packing`	Use packed chunking	Default: `true`
`training.max_steps`	Maximum training steps	Required
`training.batch_size`	Batch size	Default: 2
`training.grad_accum_steps`	Gradient accumulation	Default: 1
`training.lr`	Learning rate	Default: 3e-4
`training.dtype`	Model load + training precision	Default: `bfloat16`
`training.amp`	Whether to enable AMP	Default: auto based on dtype

dtype and amp Selection Guide

Model Type	`dtype`	`amp`	Reason
Pure Attention (simple)	`bfloat16`	`true` (default)	Memory/speed benefits with bf16 AMP
Hybrid (Mamba/Hyena/RetNet)	`float32`	`false`	FFT ops don't support bf16
MoE + Attention	`bfloat16`	`true` (default)	MoE routing is bf16 compatible
MoE + Hybrid	`float32`	`false`	Contains Hyena/RetNet FFT

Key point: torch.fft.rfft does not support bfloat16. Models containing EulerStack's Hyena or RetNet blocks must be configured with dtype: float32 + amp: false.

Prohibited Keys

Since pretrain performs full parameter training, the following keys will cause a validation error if present:

injection -- LoRA/MoE injection is exclusive to eulerforge train
moe -- MoE auxiliary loss settings are exclusive to eulerforge train
backbone -- Backbone adapters are exclusive to eulerforge train

5. Execution

5.1 Configuration Validation

eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml --validate-only

Output:

[Pretrain] Config validation passed.
[Pretrain] model_dir: /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] data.path: data/dolma_10k.jsonl
[Pretrain] training.max_steps: 500

5.2 Running Training

eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml

Example output:

[Pretrain] Device: cuda
[Pretrain] Loading model from /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] Model loaded: 1,932,345,344 params, 1,932,345,344 trainable, dtype=torch.bfloat16
[Pretrain] Tokenizer: gpt2 (vocab=50257)
[Pretrain] Loaded 10000 texts from data/dolma_10k.jsonl
[Pretrain] Packed: 4,521,600 tokens → 2,207 chunks (max_length=2048)
[Pretrain] Starting training: max_steps=500, batch_size=2, grad_accum=4, lr=0.0003, warmup=50
[Pretrain] step=10/500 | loss=10.4523 | lr=6.00e-05 | 0.42 steps/s
[Pretrain] step=20/500 | loss=9.8712 | lr=1.20e-04 | 0.43 steps/s
...
[Pretrain] step=250/500 | loss=6.1234 | lr=2.85e-04 | 0.44 steps/s
[Pretrain] Checkpoint saved: outputs/pretrain_20260327_143000/checkpoint-250
...
[Pretrain] step=500/500 | loss=5.4321 | lr=0.00e+00 | 0.44 steps/s
[Pretrain] Training complete. 500 steps in 1187.3s. Final model saved to outputs/pretrain_20260327_143000/final

5.3 Configuration Overrides

# Change learning rate
eulerforge pretrain --preset ... --set training.lr=1e-4

# More steps
eulerforge pretrain --preset ... --set training.max_steps=2000

# Specify output directory
eulerforge pretrain --preset ... --output-dir outputs/my_pretrain

# Combine multiple overrides
eulerforge pretrain --preset ... \
  --set training.max_steps=1000 \
  --set training.lr=1e-4 \
  --set data.max_length=1024

6. Output Structure

outputs/pretrain_20260327_143000/
├── pretrain_config.json         # Snapshot of the configuration used
├── metrics.jsonl                # Per-step loss, lr, elapsed time
├── checkpoint-250/              # Intermediate checkpoint
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   └── tokenizer_config.json
└── final/                       # Final model
    ├── config.json
    ├── model.safetensors
    ├── tokenizer.json
    └── tokenizer_config.json

metrics.jsonl Format

{"step": 10, "loss": 10.4523, "lr": 6e-05, "elapsed_sec": 23.8}
{"step": 20, "loss": 9.8712, "lr": 0.00012, "elapsed_sec": 47.1}

7. Packed Chunking

Data processing in pretrain differs from fine-tuning. Raw text is concatenated and split into fixed-length chunks.

Text 1: "Hello world. This is a test."
Text 2: "Another document with more content."
Text 3: "Third piece of text..."
         ↓ Tokenize + insert EOS + concatenate
[tok1, tok2, ..., EOS, tok1, tok2, ..., EOS, tok1, tok2, ..., EOS]
         ↓ Split into max_length=2048 chunks
Chunk 1: [tok1, tok2, ..., tok2048]    ← labels = input_ids (causal LM)
Chunk 2: [tok2049, tok2050, ..., tok4096]
Chunk 3: ...
(Remaining tokens less than max_length are discarded)

No padding: Every token is a valid training signal
EOS insertion: Marks document boundaries
Remainder discarded: The last incomplete chunk is not used

Setting packing: false uses per-text tokenization with padding/truncation instead.

8. Fine-Tuning After Pretraining

After pretrain completes, you can use the final/ directory as the model for eulerforge train.

# configs/presets/finetune_after_pretrain.yml
device: cuda:0
backbone: qwen3              # or the appropriate backbone
model_name: "outputs/pretrain_20260327_143000/final"

injection:
  strategy: dense_lora
  lora_r: 32
  lora_alpha: 64
  target_keywords: ["gate_proj", "up_proj", "down_proj"]
  attn_lora:
    enabled: true
    keywords: ["q_proj", "v_proj"]

training:
  type: sft
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]
  lr: 1.0e-5
  max_train_steps: 5000

eulerforge train --preset configs/presets/finetune_after_pretrain.yml \
  --set data.path=data/sft_10k_raw.jsonl \
  --set data.format=raw \
  --set data.task=sft

Note: Custom EulerStack models (model_type: eulerstack) may not be directly compatible with eulerforge train's BackboneAdapter. In such cases, you need to convert the pretrained model to standard HF format or add a dedicated backbone adapter.

9. Important Notes

Vocab Size Mismatch (Handled Automatically)

The EulerStack model's vocab_size and the tokenizer's vocab_size may differ. For example: - Model: vocab_size=32000 (defined in EulerStack spec) - Tokenizer: vocab_size=50257 (GPT-2)

If the tokenizer vocab is larger than the model vocab, token IDs exceed the embedding table range, causing a CUDA index out of bounds error.

Pretrain handles this automatically: If the tokenizer vocab > model vocab, it expands the embeddings with model.resize_token_embeddings(). Since this is scratch training, newly added embedding rows are randomly initialized and train normally.

[Pretrain] Tokenizer vocab (50257) > model vocab (32000). Resizing model embeddings to 50257.

Position IDs (Handled Automatically)

Some custom models (such as EulerStack's RetNet) may encounter a RuntimeError: view size is not compatible error when internally calling .view() on position_ids if the tensor is non-contiguous.

Pretrain handles this automatically: It explicitly generates contiguous position_ids on each forward call and passes them to the model.

GPU Memory

Scratch pretraining trains all parameters, so memory usage is high. It requires significantly more GPU memory than LoRA fine-tuning.

Model Size	Estimated GPU Memory (bf16, batch=2, accum=4)
~0.8B	~8 GB
~2B	~16 GB
~4B+	Difficult on a single GPU -- recommend batch_size=1, increase grad_accum

Learning Rate

Scratch pretraining uses a higher learning rate than fine-tuning.

Training Method	Typical Learning Rate Range
LoRA fine-tuning	1e-5 ~ 5e-5
Scratch pretrain	1e-4 ~ 6e-4

Data Volume

10k rows are suitable for architecture validation, but actual pretraining requires hundreds of thousands to millions of rows.

10. CLI Options

Option	Description
`--preset PATH`	Pretrain YAML preset path (required)
`--set KEY=VALUE`	Configuration override (repeatable)
`--output-dir DIR`	Output directory
`--validate-only`	Validate configuration only, no training

CLI Reference -- pretrain section
EulerStack Documentation -- Model assembly tool

← Prev 16. HuggingFace Export 18. Training Pipeline (SFT → PPO) Next →