17. Scratch Pretraining
This document explains how to perform scratch pretraining on models assembled/exported in HuggingFace format by external tools such as EulerStack, using raw text data.
1. pretrain vs train
| Item | eulerforge train (fine-tuning) |
eulerforge pretrain (scratch) |
|---|---|---|
| Model | Existing HF model (with weights) | Newly assembled model (initialized weights) |
| Training method | LoRA/MoE injection + phase schedule | Full parameter causal LM training |
| Data | instruction/preference JSONL | Raw text (packed chunking) |
| Injection | dense_lora, mixture_lora, etc. | None (prohibited) |
| Purpose | Adapting existing models | Architecture validation + foundational language capability training |
2. End-to-End Flow
EulerStack (model assembly) EulerForge (training)
─────────────────────────── ─────────────────────
YAML spec
↓ compile + export
HF model_dir ──────────────→ eulerforge pretrain
(config.json + ↓
model.safetensors + Pretraining complete
modeling_*.py) ↓
final/ (HF save_pretrained)
↓
eulerforge train (LoRA fine-tuning)
3. Preparation
3.1 EulerStack Model
You need a model directory exported by EulerStack in HuggingFace format.
/path/to/eulerstack/outputs/full_hybrid_moe/
├── config.json # HF config (model_type: eulerstack)
├── configuration_eulerstack.py # Custom config class
├── modeling_eulerstack.py # Custom model class
├── model.safetensors # Model weights
└── generation_config.json
Note: Since this is a custom model with
model_type: eulerstack,trust_remote_code: trueis required. Any form of EulerStack hybrid model (pure attention, mamba hybrid, tri-mixer, MoE, etc.) is supported.
3.2 Data
A raw text JSONL file is required. Each line must contain a text key.
{"text": "LABOR'S MARTYRS\n\nHaymarket\n1887\n\nSacco and Vanzetti..."}
{"text": "THROUGH THE WALL\n\nBY CLEVELAND MOFFETT..."}
This tutorial uses data/dolma_10k.jsonl (10,000 rows from the Dolma dataset).
3.3 Tokenizer
If the model directory does not contain tokenizer files, specify one separately in the preset. The EulerStack model's config.json contains tokenizer information.
"tokenizer": {"type": "hf", "pretrained": "gpt2", "add_bos": true, "add_eos": true}
-> Specify tokenizer: "gpt2" in the preset.
4. Writing a Preset
configs/presets/pretrain/eulerstack_hybrid_moe.yml:
# -- Device --
device: "cuda:0" # cuda:0, cuda:1, cpu
# -- Model --
model_dir: "outputs/full_hybrid_moe"
trust_remote_code: true
# -- Tokenizer --
tokenizer: "gpt2"
# -- Data --
data:
path: "data/dolma_10k.jsonl"
text_column: "text"
max_length: 1024
packing: true # packed chunking
# -- Training --
training:
max_steps: 500
batch_size: 2
grad_accum_steps: 4 # effective batch = 2 × 4 = 8
lr: 3.0e-4
weight_decay: 0.1
warmup_steps: 50
max_grad_norm: 1.0
log_steps: 10
save_steps: 250
dtype: "float32" # Hybrid models (Hyena/RetNet FFT) require float32
amp: false # FFT ops don't support bf16 -> disable AMP
seed: 42
Preset Key Descriptions
| Key | Description | Required |
|---|---|---|
device |
Training device (cuda:0, cuda:1, cpu) |
Default: cuda |
model_dir |
HF model directory path | Required |
trust_remote_code |
Allow custom model classes | true for EulerStack models |
tokenizer |
HF tokenizer name or path | Required if not in model_dir |
data.path |
Raw text JSONL path | Required |
data.text_column |
Text key in JSONL | Required (default: text) |
data.max_length |
Maximum sequence length | Default: 2048 |
data.packing |
Use packed chunking | Default: true |
training.max_steps |
Maximum training steps | Required |
training.batch_size |
Batch size | Default: 2 |
training.grad_accum_steps |
Gradient accumulation | Default: 1 |
training.lr |
Learning rate | Default: 3e-4 |
training.dtype |
Model load + training precision | Default: bfloat16 |
training.amp |
Whether to enable AMP | Default: auto based on dtype |
dtype and amp Selection Guide
| Model Type | dtype |
amp |
Reason |
|---|---|---|---|
| Pure Attention (simple) | bfloat16 |
true (default) |
Memory/speed benefits with bf16 AMP |
| Hybrid (Mamba/Hyena/RetNet) | float32 |
false |
FFT ops don't support bf16 |
| MoE + Attention | bfloat16 |
true (default) |
MoE routing is bf16 compatible |
| MoE + Hybrid | float32 |
false |
Contains Hyena/RetNet FFT |
Key point:
torch.fft.rfftdoes not support bfloat16. Models containing EulerStack's Hyena or RetNet blocks must be configured withdtype: float32+amp: false.
Prohibited Keys
Since pretrain performs full parameter training, the following keys will cause a validation error if present:
injection-- LoRA/MoE injection is exclusive toeulerforge trainmoe-- MoE auxiliary loss settings are exclusive toeulerforge trainbackbone-- Backbone adapters are exclusive toeulerforge train
5. Execution
5.1 Configuration Validation
eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml --validate-only
Output:
[Pretrain] Config validation passed.
[Pretrain] model_dir: /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] data.path: data/dolma_10k.jsonl
[Pretrain] training.max_steps: 500
5.2 Running Training
eulerforge pretrain --preset configs/presets/pretrain/eulerstack_hybrid_moe.yml
Example output:
[Pretrain] Device: cuda
[Pretrain] Loading model from /path/to/eulerstack/outputs/full_hybrid_moe
[Pretrain] Model loaded: 1,932,345,344 params, 1,932,345,344 trainable, dtype=torch.bfloat16
[Pretrain] Tokenizer: gpt2 (vocab=50257)
[Pretrain] Loaded 10000 texts from data/dolma_10k.jsonl
[Pretrain] Packed: 4,521,600 tokens → 2,207 chunks (max_length=2048)
[Pretrain] Starting training: max_steps=500, batch_size=2, grad_accum=4, lr=0.0003, warmup=50
[Pretrain] step=10/500 | loss=10.4523 | lr=6.00e-05 | 0.42 steps/s
[Pretrain] step=20/500 | loss=9.8712 | lr=1.20e-04 | 0.43 steps/s
...
[Pretrain] step=250/500 | loss=6.1234 | lr=2.85e-04 | 0.44 steps/s
[Pretrain] Checkpoint saved: outputs/pretrain_20260327_143000/checkpoint-250
...
[Pretrain] step=500/500 | loss=5.4321 | lr=0.00e+00 | 0.44 steps/s
[Pretrain] Training complete. 500 steps in 1187.3s. Final model saved to outputs/pretrain_20260327_143000/final
5.3 Configuration Overrides
# Change learning rate
eulerforge pretrain --preset ... --set training.lr=1e-4
# More steps
eulerforge pretrain --preset ... --set training.max_steps=2000
# Specify output directory
eulerforge pretrain --preset ... --output-dir outputs/my_pretrain
# Combine multiple overrides
eulerforge pretrain --preset ... \
--set training.max_steps=1000 \
--set training.lr=1e-4 \
--set data.max_length=1024
6. Output Structure
outputs/pretrain_20260327_143000/
├── pretrain_config.json # Snapshot of the configuration used
├── metrics.jsonl # Per-step loss, lr, elapsed time
├── checkpoint-250/ # Intermediate checkpoint
│ ├── config.json
│ ├── model.safetensors
│ ├── tokenizer.json
│ └── tokenizer_config.json
└── final/ # Final model
├── config.json
├── model.safetensors
├── tokenizer.json
└── tokenizer_config.json
metrics.jsonl Format
{"step": 10, "loss": 10.4523, "lr": 6e-05, "elapsed_sec": 23.8}
{"step": 20, "loss": 9.8712, "lr": 0.00012, "elapsed_sec": 47.1}
7. Packed Chunking
Data processing in pretrain differs from fine-tuning. Raw text is concatenated and split into fixed-length chunks.
Text 1: "Hello world. This is a test."
Text 2: "Another document with more content."
Text 3: "Third piece of text..."
↓ Tokenize + insert EOS + concatenate
[tok1, tok2, ..., EOS, tok1, tok2, ..., EOS, tok1, tok2, ..., EOS]
↓ Split into max_length=2048 chunks
Chunk 1: [tok1, tok2, ..., tok2048] ← labels = input_ids (causal LM)
Chunk 2: [tok2049, tok2050, ..., tok4096]
Chunk 3: ...
(Remaining tokens less than max_length are discarded)
- No padding: Every token is a valid training signal
- EOS insertion: Marks document boundaries
- Remainder discarded: The last incomplete chunk is not used
Setting packing: false uses per-text tokenization with padding/truncation instead.
8. Fine-Tuning After Pretraining
After pretrain completes, you can use the final/ directory as the model for eulerforge train.
# configs/presets/finetune_after_pretrain.yml
device: cuda:0
backbone: qwen3 # or the appropriate backbone
model_name: "outputs/pretrain_20260327_143000/final"
injection:
strategy: dense_lora
lora_r: 32
lora_alpha: 64
target_keywords: ["gate_proj", "up_proj", "down_proj"]
attn_lora:
enabled: true
keywords: ["q_proj", "v_proj"]
training:
type: sft
phases:
- step: 0
trainable: ["lora", "attn_lora"]
lr: 1.0e-5
max_train_steps: 5000
eulerforge train --preset configs/presets/finetune_after_pretrain.yml \
--set data.path=data/sft_10k_raw.jsonl \
--set data.format=raw \
--set data.task=sft
Note: Custom EulerStack models (
model_type: eulerstack) may not be directly compatible witheulerforge train's BackboneAdapter. In such cases, you need to convert the pretrained model to standard HF format or add a dedicated backbone adapter.
9. Important Notes
Vocab Size Mismatch (Handled Automatically)
The EulerStack model's vocab_size and the tokenizer's vocab_size may differ. For example:
- Model: vocab_size=32000 (defined in EulerStack spec)
- Tokenizer: vocab_size=50257 (GPT-2)
If the tokenizer vocab is larger than the model vocab, token IDs exceed the embedding table range, causing a CUDA index out of bounds error.
Pretrain handles this automatically: If the tokenizer vocab > model vocab, it expands the embeddings with model.resize_token_embeddings(). Since this is scratch training, newly added embedding rows are randomly initialized and train normally.
[Pretrain] Tokenizer vocab (50257) > model vocab (32000). Resizing model embeddings to 50257.
Position IDs (Handled Automatically)
Some custom models (such as EulerStack's RetNet) may encounter a RuntimeError: view size is not compatible error when internally calling .view() on position_ids if the tensor is non-contiguous.
Pretrain handles this automatically: It explicitly generates contiguous position_ids on each forward call and passes them to the model.
GPU Memory
Scratch pretraining trains all parameters, so memory usage is high. It requires significantly more GPU memory than LoRA fine-tuning.
| Model Size | Estimated GPU Memory (bf16, batch=2, accum=4) |
|---|---|
| ~0.8B | ~8 GB |
| ~2B | ~16 GB |
| ~4B+ | Difficult on a single GPU -- recommend batch_size=1, increase grad_accum |
Learning Rate
Scratch pretraining uses a higher learning rate than fine-tuning.
| Training Method | Typical Learning Rate Range |
|---|---|
| LoRA fine-tuning | 1e-5 ~ 5e-5 |
| Scratch pretrain | 1e-4 ~ 6e-4 |
Data Volume
10k rows are suitable for architecture validation, but actual pretraining requires hundreds of thousands to millions of rows.
10. CLI Options
| Option | Description |
|---|---|
--preset PATH |
Pretrain YAML preset path (required) |
--set KEY=VALUE |
Configuration override (repeatable) |
--output-dir DIR |
Output directory |
--validate-only |
Validate configuration only, no training |
Related Documents
- CLI Reference -- pretrain section
- EulerStack Documentation -- Model assembly tool