Home > EulerStack > Tutorials > 5. Prepare Training Data

5. Prepare Training Data

⚠️ Scope warning — this is not EulerStack's main purpose

EulerStack is an Architecture Description Language (ADL) — it is not a training framework and it is not a data pipeline. The eulerstack.data.prepare module shown in this tutorial is a Sanity helper for developers that verifies an architecture doesn't blow itself up. Nothing more.

Concretely, here is what this module does and does not do:

EulerStack does EulerStack does not do
Validate / compile YAML specs Production data curation (dedup, quality filtering, PII removal, …)
Export HF PreTrainedModel directories Multi-source mixing, sampling-weight tuning
Tokenize small samples for architecture regression checks Streaming / sharding / distributed I/O
Cache tokenize_jsonl results on disk Tokenizer training or vocabulary extension

For real LLM projects always use a dedicated data pipeline. This helper is fit only for "I added a new mixer and want a 20-step sanity run" scenarios.

HF models exported from EulerStack YAML slot cleanly into every tool below.

Use case Recommended tool Why
Large-scale pretraining corpus curation dolma (AI2), datatrove (HF) Production-grade dedup, quality filter, PII removal
General HF Datasets streaming / sharding datasets + streaming=True Standard multi-TB streaming, caching, mapping
Fine-tune / SFT data Axolotl / LLaMA-Factory / torchtune Chat templates, packing, multi-task sampling
RLHF / preference data TRL, OpenRLHF Formatting tailored to PPO / DPO / GRPO recipes
HPC distributed pretraining MosaicML Composer, Megatron-LM, TorchTitan Billions of tokens/GPU throughput, stream-from-S3

The output of any of these pipelines drops straight into a model loaded via AutoModelForCausalLM.from_pretrained("./my_eulerstack_export", trust_remote_code=True). EulerStack is the model-definition layer; it composes orthogonally with every tool above. See Tutorial 0: Where EulerStack Fits for the full positioning.

What this tutorial actually covers — an internal sanity helper

Everything below documents a minimal helper used only by this repository's own tests and sanity loops. It is not for production training. The reason it exists at all:

When you add a new mixer or primitive to EulerStack, you need a CPU-friendly regression test that checks "does this structure decrease loss at all in a few seconds?" — that is the only job of these helpers.

So the tokenize_jsonl / TokenizedDataset helpers described below:

Read the rest of this tutorial with that constraint in mind.

Two Kinds of Data Source

EulerStack supports two sources.

Local JSONL files

The simplest and most reproducible format. One JSON object per line, with a text field holding the raw document text.

{"text": "The quick brown fox jumps over the lazy dog.", "source": "example"}
{"text": "Another document with more text content.", "source": "example"}

The repository already bundles data/dolma_10k.jsonl, which contains 10,000 English documents extracted from the Dolma corpus (Soldaini et al., 2024). It needs no download and serves as a stable test dataset. Most of the sanity and integration tests use this file as the default data source.

HuggingFace Datasets

When you want standard benchmarks (wikitext, c4, etc.), pull them from HF Datasets.

from eulerstack.data.prepare import download_and_tokenize

dataset = download_and_tokenize(
    tokenizer_name="gpt2",
    max_seq_len=512,
    dataset_name="wikitext",
    dataset_config="wikitext-2-raw-v1",
    num_rows=10000,
)

Requires internet access. Downloads go to the HF cache (~/.cache/huggingface/ by default).

Using the Tokenization Pipeline

The most common call to tokenize a local JSONL file is:

from eulerstack.data.prepare import tokenize_jsonl

dataset = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,      # use only the first 1,000 documents
    cache_dir="data/.cache",
)
print(f"Chunks: {len(dataset)}, Seq len: 512")

Argument meanings:

What happens inside

tokenize_jsonl runs these steps internally.

  1. Read the JSONL and extract the text field of each line
  2. Tokenize each document using the specified HF tokenizer. By default BOS/EOS special tokens are added at each document boundary
  3. Concatenate all tokens into one long stream
  4. Chunk into max_seq_len-sized pieces. No padding, no truncation — any leftover partial chunk at the end is dropped. This is the standard "packed chunking" technique that maximizes GPU utilization
  5. Save to cache as a .pt tensor plus a .meta.json file. A subsequent call with the same arguments skips tokenization entirely

Cache key composition

Whether a cached dataset is reused depends on a hash of these inputs:

If any input changes, the cache becomes invalid and tokenization re-runs. Cache files are located at:

data/.cache/tokenized_<hash>.pt
data/.cache/tokenized_<hash>.meta.json

The .meta.json file records the arguments used, so you can trace which parameters produced any given cache file later.

Vocab-Size Clamping

This step is required whenever the tokenizer's vocab size differs from the model's vocab size. For example, GPT-2's tokenizer has 50,257 token IDs, but EulerStack default presets use vocab_size=32000. Feeding raw tokenized output directly into the model causes index out of bounds errors.

The fix is to wrap the dataset with TokenizedDataset and clamp to the model's vocab size.

from eulerstack.data.prepare import TokenizedDataset

# Raw tokens from cache
raw_dataset = tokenize_jsonl(...)

# Clamp to the model's vocab_size
safe_dataset = TokenizedDataset(raw_dataset.input_ids, vocab_size=32000)

TokenizedDataset clamps input_ids and labels into [0, vocab_size) at __getitem__ time. Because the underlying data is not modified, caches remain valid and different models can share the same tokenized cache.

Plugging into the sanity loop (internal test use only)

Tokenized samples plug into the sanity loop from the next tutorial — also internal-test only. Once more: this is not production training.

from eulerstack.training.sanity import run_e2e_training, build_model_from_ir
from eulerstack.ir.normalizer import normalize_to_ir
from eulerstack.data.prepare import tokenize_jsonl, TokenizedDataset
import yaml, torch

# Load preset and produce IR
with open("configs/presets/arch_beginner_llama.yml") as f:
    ir = normalize_to_ir(yaml.safe_load(f))

# Tokenize + vocab clamp — sample-scale for this repo's CI
raw = tokenize_jsonl(
    jsonl_path="data/dolma_10k.jsonl",
    tokenizer_name="gpt2",
    max_seq_len=512,
    num_rows=1000,
    cache_dir="data/.cache",
)
dataset = TokenizedDataset(raw.input_ids, vocab_size=ir.model.vocab_size)

# Build model and run a 20-step sanity train (structural smoke test only)
model = build_model_from_ir(ir, device="cuda:0", dtype=torch.bfloat16)
result = run_e2e_training(
    model, dataset, batch_size=2, max_steps=20, device="cuda:0",
)
print(result.summary())

A 20-step loss trace only tells you "the code didn't break." Actual training quality, downstream scores, reward shaping, multi-GPU scaling are measured by the dedicated frameworks in the §"Recommended tools" table above.

Pre-flight checklist (for real pipelines on external tools)

When graduating to a dedicated data pipeline, verify these — but note that EulerStack is not the tool that handles them:

All of these live outside EulerStack's scope — the ADL layer guarantees a consistent model no matter how the data arrives.

Next steps

Keep learning inside this repo

Graduating to real training