5. Prepare Training Data

⚠️ Scope warning — this is not EulerStack's main purpose

EulerStack is an Architecture Description Language (ADL) — it is not a training framework and it is not a data pipeline. The eulerstack.data.prepare module shown in this tutorial is a Sanity helper for developers that verifies an architecture doesn't blow itself up. Nothing more.

Concretely, here is what this module does and does not do:

EulerStack does	EulerStack does not do
Validate / compile YAML specs	Production data curation (dedup, quality filtering, PII removal, …)
Export HF PreTrainedModel directories	Multi-source mixing, sampling-weight tuning
Tokenize small samples for architecture regression checks	Streaming / sharding / distributed I/O
Cache `tokenize_jsonl` results on disk	Tokenizer training or vocabulary extension

For real LLM projects always use a dedicated data pipeline. This helper is fit only for "I added a new mixer and want a 20-step sanity run" scenarios.

Recommended tools for real training data

HF models exported from EulerStack YAML slot cleanly into every tool below.

Use case	Recommended tool	Why
Large-scale pretraining corpus curation	`dolma` (AI2), `datatrove` (HF)	Production-grade dedup, quality filter, PII removal
General HF Datasets streaming / sharding	`datasets` + `streaming=True`	Standard multi-TB streaming, caching, mapping
Fine-tune / SFT data	Axolotl / LLaMA-Factory / torchtune	Chat templates, packing, multi-task sampling
RLHF / preference data	TRL, OpenRLHF	Formatting tailored to PPO / DPO / GRPO recipes
HPC distributed pretraining	MosaicML Composer, Megatron-LM, TorchTitan	Billions of tokens/GPU throughput, stream-from-S3

The output of any of these pipelines drops straight into a model loaded via AutoModelForCausalLM.from_pretrained("./my_eulerstack_export", trust_remote_code=True). EulerStack is the model-definition layer; it composes orthogonally with every tool above. See Tutorial 0: Where EulerStack Fits for the full positioning.

What this tutorial actually covers — an internal sanity helper

Everything below documents a minimal helper used only by this repository's own tests and sanity loops. It is not for production training. The reason it exists at all:

When you add a new mixer or primitive to EulerStack, you need a CPU-friendly regression test that checks "does this structure decrease loss at all in a few seconds?" — that is the only job of these helpers.

So the tokenize_jsonl / TokenizedDataset helpers described below:

Handle at most ~1,000 documents per call
Cache to data/.cache/ as tiny .pt files
Do not support multi-file / streaming / sharding
Provide no production-quality dedup / PII removal / quality filter

Read the rest of this tutorial with that constraint in mind.

Two Kinds of Data Source

EulerStack supports two sources.

Local JSONL files

The simplest and most reproducible format. One JSON object per line, with a text field holding the raw document text.

{"text": "The quick brown fox jumps over the lazy dog.", "source": "example"}
{"text": "Another document with more text content.", "source": "example"}

The repository already bundles data/dolma_10k.jsonl, which contains 10,000 English documents extracted from the Dolma corpus (Soldaini et al., 2024). It needs no download and serves as a stable test dataset. Most of the sanity and integration tests use this file as the default data source.

HuggingFace Datasets

When you want standard benchmarks (wikitext, c4, etc.), pull them from HF Datasets.

from eulerstack.data.prepare import download_and_tokenize

dataset = download_and_tokenize(
    tokenizer_name="gpt2",
    max_seq_len=512,
    dataset_name="wikitext",
    dataset_config="wikitext-2-raw-v1",
    num_rows=10000,
)

Requires internet access. Downloads go to the HF cache (~/.cache/huggingface/ by default).

Using the Tokenization Pipeline

The most common call to tokenize a local JSONL file is:

from eulerstack.data.prepare import tokenize_jsonl

dataset = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,      # use only the first 1,000 documents
    cache_dir="data/.cache",
)
print(f"Chunks: {len(dataset)}, Seq len: 512")

Argument meanings:

jsonl_path: path to the JSONL file to read
tokenizer_name: HF tokenizer name (e.g. "gpt2", "meta-llama/Llama-2-7b-hf")
max_seq_len: token length of each chunk
num_rows: how many documents to read from the top. Smaller means faster tokenization
cache_dir: where to persist the cache. Subsequent calls with the same arguments return instantly

What happens inside

tokenize_jsonl runs these steps internally.

Read the JSONL and extract the text field of each line
Tokenize each document using the specified HF tokenizer. By default BOS/EOS special tokens are added at each document boundary
Concatenate all tokens into one long stream
Chunk into max_seq_len-sized pieces. No padding, no truncation — any leftover partial chunk at the end is dropped. This is the standard "packed chunking" technique that maximizes GPU utilization
Save to cache as a .pt tensor plus a .meta.json file. A subsequent call with the same arguments skips tokenization entirely

Cache key composition

Whether a cached dataset is reused depends on a hash of these inputs:

tokenizer_name
max_seq_len
jsonl_path (absolute)
num_rows
Whether BOS/EOS are added

If any input changes, the cache becomes invalid and tokenization re-runs. Cache files are located at:

data/.cache/tokenized_<hash>.pt
data/.cache/tokenized_<hash>.meta.json

The .meta.json file records the arguments used, so you can trace which parameters produced any given cache file later.

Vocab-Size Clamping

This step is required whenever the tokenizer's vocab size differs from the model's vocab size. For example, GPT-2's tokenizer has 50,257 token IDs, but EulerStack default presets use vocab_size=32000. Feeding raw tokenized output directly into the model causes index out of bounds errors.

The fix is to wrap the dataset with TokenizedDataset and clamp to the model's vocab size.

from eulerstack.data.prepare import TokenizedDataset

# Raw tokens from cache
raw_dataset = tokenize_jsonl(...)

# Clamp to the model's vocab_size
safe_dataset = TokenizedDataset(raw_dataset.input_ids, vocab_size=32000)

TokenizedDataset clamps input_ids and labels into [0, vocab_size) at __getitem__ time. Because the underlying data is not modified, caches remain valid and different models can share the same tokenized cache.

Plugging into the sanity loop (internal test use only)

Tokenized samples plug into the sanity loop from the next tutorial — also internal-test only. Once more: this is not production training.

from eulerstack.training.sanity import run_e2e_training, build_model_from_ir
from eulerstack.ir.normalizer import normalize_to_ir
from eulerstack.data.prepare import tokenize_jsonl, TokenizedDataset
import yaml, torch

# Load preset and produce IR
with open("configs/presets/arch_beginner_llama.yml") as f:
    ir = normalize_to_ir(yaml.safe_load(f))

# Tokenize + vocab clamp — sample-scale for this repo's CI
raw = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,
    cache_dir="data/.cache",
)
dataset = TokenizedDataset(raw.input_ids, vocab_size=ir.model.vocab_size)

# Build model and run a 20-step sanity train (structural smoke test only)
model = build_model_from_ir(ir, device="cuda:0", dtype=torch.bfloat16)
result = run_e2e_training(
    model, dataset, batch_size=2, max_steps=20, device="cuda:0",
)
print(result.summary())

A 20-step loss trace only tells you "the code didn't break." Actual training quality, downstream scores, reward shaping, multi-GPU scaling are measured by the dedicated frameworks in the §"Recommended tools" table above.

Pre-flight checklist (for real pipelines on external tools)

When graduating to a dedicated data pipeline, verify these — but note that EulerStack is not the tool that handles them:

Data-to-model size ratio — Chinchilla-style scaling calls for ~20B tokens per 1B parameters. That volume is the job of datatrove / dolma, not EulerStack.
Quality filtering — repetition detection, URL blocklists, perplexity filters. Built into datatrove / dolma.
Deduplication — document-level + n-gram level. A must for production pretraining.
Tokenizer/vocab consistency — tokenizer.vocab_size == model.config.vocab_size (§"Vocab clamping" above is a temporary workaround when you cannot align them).
max_seq_len ≤ the model's max_seq_len — otherwise positions overflow.
Sharding and streaming — tens of GB and up need streaming (datasets streaming / WebDataset tar / Mosaic MDS).

All of these live outside EulerStack's scope — the ADL layer guarantees a consistent model no matter how the data arrives.

Next steps

Keep learning inside this repo

Tutorial 6: Sanity Train — also internal-test only. A CI-friendly tiny check that the architecture can descend loss at all.

Graduating to real training

Tutorial 0: Where EulerStack Fits — long-form rationale for why training lives in other tools.
Tutorial 4: Compile and Explain — exporting to an HF directory. From there, pick an external trainer — Axolotl / LLaMA-Factory / torchtune / TRL / HF Trainer / TorchTitan / Megatron-LM — according to your use case.

← Prev 4. Compile & Explain 6. Sanity Training Loop Next →