Home > EulerForge > Tutorials > 17. Scratch Pretraining

17. Scratch Pretraining

This document explains how to perform scratch pretraining on models assembled/exported in HuggingFace format by external tools such as EulerStack, using raw text data.


1. pretrain vs train

Item eulerforge train (fine-tuning) eulerforge pretrain (scratch)
Model Existing HF model (with weights) Newly assembled model (initialized weights)
Training method LoRA/MoE injection + phase schedule Full parameter causal LM training
Data instruction/preference JSONL Raw text (packed chunking)
Injection dense_lora, mixture_lora, etc. None (prohibited)
Purpose Adapting existing models Architecture validation + foundational language capability training

2. End-to-End Flow

EulerStack (model assembly)     EulerForge (training)
───────────────────────────     ─────────────────────
YAML spec
  ↓ compile + export
HF model_dir ──────────────→ eulerforge pretrain
  (config.json +                 ↓
   model.safetensors +       Pretraining complete
   modeling_*.py)                ↓
                             final/ (HF save_pretrained)
                                 ↓
                             eulerforge train (LoRA fine-tuning)

3. Preparation

3.1 EulerStack Model

You need a model directory exported by EulerStack in HuggingFace format.

/path/to/eulerstack/outputs/full_hybrid_moe/
├── config.json                    # HF config (model_type: eulerstack)
├── configuration_eulerstack.py    # Custom config class
├── modeling_eulerstack.py         # Custom model class
├── model.safetensors              # Model weights
└── generation_config.json

Note: Since this is a custom model with model_type: eulerstack, trust_remote_code: true is required. Any form of EulerStack hybrid model (pure attention, mamba hybrid, tri-mixer, MoE, etc.) is supported.

3.2 Data

A raw text JSONL file is required. Each line must contain a text key.

{"text": "LABOR'S MARTYRS\n\nHaymarket\n1887\n\nSacco and Vanzetti..."}
{"text": "THROUGH THE WALL\n\nBY CLEVELAND MOFFETT..."}

This tutorial uses data/dolma_10k.jsonl (10,000 rows from the Dolma dataset).

3.3 Tokenizer

If the model directory does not contain tokenizer files, specify one separately in the preset. The EulerStack model's config.json contains tokenizer information.

"tokenizer": {"type": "hf", "pretrained": "gpt2", "add_bos": true, "add_eos": true}

-> Specify tokenizer: "gpt2" in the preset.


4. Writing a Preset

configs/presets/pretrain/eulerstack_hybrid_moe.yml:

# -- Device --
device: "cuda:0"             # cuda:0, cuda:1, cpu

# -- Model --
model_dir: "outputs/full_hybrid_moe"
trust_remote_code: true

# -- Tokenizer --
tokenizer: "gpt2"

# -- Data --
data:
  path: "data/dolma_10k.jsonl"
  text_column: "text"
  max_length: 1024
  packing: true              # packed chunking

# -- Training --
training:
  max_steps: 500
  batch_size: 2
  grad_accum_steps: 4        # effective batch = 2 × 4 = 8
  lr: 3.0e-4
  weight_decay: 0.1
  warmup_steps: 50
  max_grad_norm: 1.0
  log_steps: 10
  save_steps: 250
  dtype: "float32"           # Hybrid models (Hyena/RetNet FFT) require float32
  amp: false                 # FFT ops don't support bf16 -> disable AMP
  seed: 42

Preset Key Descriptions

Key Description Required
device Training device (cuda:0, cuda:1, cpu) Default: cuda
model_dir HF model directory path Required
trust_remote_code Allow custom model classes true for EulerStack models
tokenizer HF tokenizer name or path Required if not in model_dir
data.path Raw text JSONL path Required
data.text_column Text key in JSONL Required (default: text)
data.max_length Maximum sequence length Default: 2048
data.packing Use packed chunking Default: true
training.max_steps Maximum training steps Required
training.batch_size Batch size Default: 2
training.grad_accum_steps Gradient accumulation Default: 1
training.lr Learning rate Default: 3e-4
training.dtype Model load + training precision Default: bfloat16
training.amp Whether to enable AMP Default: auto based on dtype

dtype and amp Selection Guide

Model Type dtype amp Reason
Pure Attention (simple) bfloat16 true (default) Memory/speed benefits with bf16 AMP
Hybrid (Mamba/Hyena/RetNet) float32 false FFT ops don't support bf16
MoE + Attention bfloat16 true (default) MoE routing is bf16 compatible
MoE + Hybrid float32 false Contains Hyena/RetNet FFT

Key point: torch.fft.rfft does not support bfloat16. Models containing EulerStack's Hyena or RetNet blocks must be configured with dtype: float32 + amp: false.

Prohibited Keys

Since pretrain performs full parameter training, the following keys will cause a validation error if present:


5. Execution

5.1 Configuration Validation

eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml --validate-only

Output:

[Pretrain] Config validation passed.
[Pretrain] model_dir: /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] data.path: data/dolma_10k.jsonl
[Pretrain] training.max_steps: 500

5.2 Running Training

eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml

Example output:

[Pretrain] Device: cuda
[Pretrain] Loading model from /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] Model loaded: 1,932,345,344 params, 1,932,345,344 trainable, dtype=torch.bfloat16
[Pretrain] Tokenizer: gpt2 (vocab=50257)
[Pretrain] Loaded 10000 texts from data/dolma_10k.jsonl
[Pretrain] Packed: 4,521,600 tokens → 2,207 chunks (max_length=2048)
[Pretrain] Starting training: max_steps=500, batch_size=2, grad_accum=4, lr=0.0003, warmup=50
[Pretrain] step=10/500 | loss=10.4523 | lr=6.00e-05 | 0.42 steps/s
[Pretrain] step=20/500 | loss=9.8712 | lr=1.20e-04 | 0.43 steps/s
...
[Pretrain] step=250/500 | loss=6.1234 | lr=2.85e-04 | 0.44 steps/s
[Pretrain] Checkpoint saved: outputs/pretrain_20260327_143000/checkpoint-250
...
[Pretrain] step=500/500 | loss=5.4321 | lr=0.00e+00 | 0.44 steps/s
[Pretrain] Training complete. 500 steps in 1187.3s. Final model saved to outputs/pretrain_20260327_143000/final

5.3 Configuration Overrides

# Change learning rate
eulerforge pretrain --preset ... --set training.lr=1e-4

# More steps
eulerforge pretrain --preset ... --set training.max_steps=2000

# Specify output directory
eulerforge pretrain --preset ... --output-dir outputs/my_pretrain

# Combine multiple overrides
eulerforge pretrain --preset ... \
  --set training.max_steps=1000 \
  --set training.lr=1e-4 \
  --set data.max_length=1024

6. Output Structure

outputs/pretrain_20260327_143000/
├── pretrain_config.json         # Snapshot of the configuration used
├── metrics.jsonl                # Per-step loss, lr, elapsed time
├── checkpoint-250/              # Intermediate checkpoint
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   └── tokenizer_config.json
└── final/                       # Final model
    ├── config.json
    ├── model.safetensors
    ├── tokenizer.json
    └── tokenizer_config.json

metrics.jsonl Format

{"step": 10, "loss": 10.4523, "lr": 6e-05, "elapsed_sec": 23.8}
{"step": 20, "loss": 9.8712, "lr": 0.00012, "elapsed_sec": 47.1}

7. Packed Chunking

Data processing in pretrain differs from fine-tuning. Raw text is concatenated and split into fixed-length chunks.

Text 1: "Hello world. This is a test."
Text 2: "Another document with more content."
Text 3: "Third piece of text..."
         ↓ Tokenize + insert EOS + concatenate
[tok1, tok2, ..., EOS, tok1, tok2, ..., EOS, tok1, tok2, ..., EOS]
         ↓ Split into max_length=2048 chunks
Chunk 1: [tok1, tok2, ..., tok2048]    ← labels = input_ids (causal LM)
Chunk 2: [tok2049, tok2050, ..., tok4096]
Chunk 3: ...
(Remaining tokens less than max_length are discarded)

Setting packing: false uses per-text tokenization with padding/truncation instead.


8. Fine-Tuning After Pretraining

After pretrain completes, you can use the final/ directory as the model for eulerforge train.

# configs/presets/finetune_after_pretrain.yml
device: cuda:0
backbone: qwen3              # or the appropriate backbone
model_name: "outputs/pretrain_20260327_143000/final"

injection:
  strategy: dense_lora
  lora_r: 32
  lora_alpha: 64
  target_keywords: ["gate_proj", "up_proj", "down_proj"]
  attn_lora:
    enabled: true
    keywords: ["q_proj", "v_proj"]

training:
  type: sft
  phases:
    - step: 0
      trainable: ["lora", "attn_lora"]
  lr: 1.0e-5
  max_train_steps: 5000
eulerforge train --preset configs/presets/finetune_after_pretrain.yml \
  --set data.path=data/sft_10k_raw.jsonl \
  --set data.format=raw \
  --set data.task=sft

Note: Custom EulerStack models (model_type: eulerstack) may not be directly compatible with eulerforge train's BackboneAdapter. In such cases, you need to convert the pretrained model to standard HF format or add a dedicated backbone adapter.


9. Important Notes

Vocab Size Mismatch (Handled Automatically)

The EulerStack model's vocab_size and the tokenizer's vocab_size may differ. For example: - Model: vocab_size=32000 (defined in EulerStack spec) - Tokenizer: vocab_size=50257 (GPT-2)

If the tokenizer vocab is larger than the model vocab, token IDs exceed the embedding table range, causing a CUDA index out of bounds error.

Pretrain handles this automatically: If the tokenizer vocab > model vocab, it expands the embeddings with model.resize_token_embeddings(). Since this is scratch training, newly added embedding rows are randomly initialized and train normally.

[Pretrain] Tokenizer vocab (50257) > model vocab (32000). Resizing model embeddings to 50257.

Position IDs (Handled Automatically)

Some custom models (such as EulerStack's RetNet) may encounter a RuntimeError: view size is not compatible error when internally calling .view() on position_ids if the tensor is non-contiguous.

Pretrain handles this automatically: It explicitly generates contiguous position_ids on each forward call and passes them to the model.

GPU Memory

Scratch pretraining trains all parameters, so memory usage is high. It requires significantly more GPU memory than LoRA fine-tuning.

Model Size Estimated GPU Memory (bf16, batch=2, accum=4)
~0.8B ~8 GB
~2B ~16 GB
~4B+ Difficult on a single GPU -- recommend batch_size=1, increase grad_accum

Learning Rate

Scratch pretraining uses a higher learning rate than fine-tuning.

Training Method Typical Learning Rate Range
LoRA fine-tuning 1e-5 ~ 5e-5
Scratch pretrain 1e-4 ~ 6e-4

Data Volume

10k rows are suitable for architecture validation, but actual pretraining requires hundreds of thousands to millions of rows.


10. CLI Options

Option Description
--preset PATH Pretrain YAML preset path (required)
--set KEY=VALUE Configuration override (repeatable)
--output-dir DIR Output directory
--validate-only Validate configuration only, no training