0. Data Preprocessing
Run this tutorial first. Converts raw data in the
data/directory to EulerForge standard raw JSONL format. All subsequent tutorials use the converted files withdata.format=raw.
1. EulerForge Standard Raw Format
EulerForge supports 3 types of standard raw JSONL:
| task | Standard raw schema | Purpose |
|---|---|---|
sft |
{"prompt": "...", "response": "..."} or {"text": "..."} |
SFT, PPO |
preference |
{"chosen": "...", "rejected": "..."} |
Preference (no prompt) |
prompted_preference |
{"prompt": "...", "chosen": "...", "rejected": "..."} |
DPO, ORPO, RM |
When using data.format=raw, the data is automatically tokenized (converted to processed format) at the start of training.
2. Provided Data Files
The data/ directory contains original files and already-converted _raw files.
File List
| File | Description | Size |
|---|---|---|
sft_10k.jsonl |
SFT original (nested messages structure) | 10k |
sft_10k_raw.jsonl |
Standard SFT raw (converted) | 10k |
sft_1k.jsonl |
SFT original (for TDD/testing) | 1k |
sft_1k_bench_raw.jsonl |
SFT bench data (converted) | 1k |
dpo_10k.jsonl |
DPO original (nested structure) | 10k |
dpo_10k_raw.jsonl |
Standard prompted_preference raw (converted) | 10k |
dpo_1k.jsonl |
DPO original (for TDD/testing) | 1k |
dpo_1k_bench_raw.jsonl |
DPO bench data (converted) | 1k |
dpo_10k.jsonl |
DPO preference data (already close to standard) | 10k |
dpo_10k_raw.jsonl |
Standard prompted_preference raw (converted) | 10k |
dpo_1k.jsonl |
DPO preference data (for TDD/testing) | 1k |
dpo_1k_bench_raw.jsonl |
SHP bench data (converted) | 1k |
Naming Convention
- Original (
*.jsonl): Pre-conversion data. May have nested structures _raw(*_raw.jsonl): Converted to standard columns (prompt,response,chosen,rejected)_bench_raw(*_bench_raw.jsonl): Foreulerforge benchtesting (converted)1k: For TDD/development (1k rows). May cause errors during training due to insufficient data10k: Tutorial/production baseline (10k rows)
All tutorials assume 10k data. Since
_rawfiles are already converted, you can skip the convert step in most tutorials.
3. Data Conversion (eulerforge convert)
The core philosophy of eulerforge convert is --map-based mapping.
Built-in recipes (--recipe) are convenience features for commonly used structures.
Field Exploration (--print-sample-flat)
Inspect the field structure of the input file before conversion:
eulerforge convert --task sft --input data/sft_10k.jsonl --print-sample-flat
# Example output:
# --- row 0 flat keys ---
# 'json_record.messages' (list) [{'role': 'system', ...}]
(A) sft_10k.jsonl to Standard SFT Raw
sft_10k.jsonl has a messages array structure. Choose one of two methods:
Method 1: Recipe mode (messages array)
eulerforge convert \
--task sft \
--input data/sft_10k.jsonl \
--output data/sft_10k_raw.jsonl \
--recipe sft_messages \
--messages-expr json_record.messages
Method 2: --map mode (custom flat fields)
If your custom data has flat fields like instruction, output:
eulerforge convert \
--task sft \
--input data/custom.jsonl \
--output data/out_raw.jsonl \
--map prompt=instruction \
--map response=output
Validation:
head -1 data/sft_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'response']
(B) dpo_10k.jsonl to Standard prompted_preference Raw
dpo_10k.jsonl has a nested {instruction.value, chosen.value, rejected.value} structure:
Method 1: --map mode (automatic dot-path flatten)
eulerforge convert \
--task prompted_preference \
--input data/dpo_10k.jsonl \
--output data/dpo_10k_raw.jsonl \
--map prompt=instruction.value \
--map chosen=chosen.value \
--map rejected=rejected.value
Method 2: Recipe mode
eulerforge convert \
--task prompted_preference \
--input data/dpo_10k.jsonl \
--output data/dpo_10k_raw.jsonl \
--recipe dpo_nested_v1
Validation:
head -1 data/dpo_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'chosen', 'rejected']
(C) dpo_10k.jsonl to Standard prompted_preference Raw
dpo_10k.jsonl already has a near-standard {prompt, chosen, rejected} structure. The converted dpo_10k_raw.jsonl is already provided. Schema validation:
eulerforge convert --task prompted_preference --input data/dpo_10k_raw.jsonl --validate
# Output: [Validate] 10000/10000 valid rows (0 invalid)
Multi-CPU Conversion
For large datasets, use --num-proc for parallel processing:
eulerforge convert \
--task sft \
--input data/sft_10k.jsonl \
--output data/sft_10k_raw.jsonl \
--recipe sft_messages \
--messages-expr json_record.messages \
--num-proc 4 \
--overwrite
4. Conversion Results Summary
| Converted File | task | Used in Tutorials |
|---|---|---|
data/sft_10k_raw.jsonl |
sft |
01-04 (SFT strategies), 08 (PPO) |
data/dpo_10k_raw.jsonl |
prompted_preference |
05 (DPO) |
data/dpo_10k_raw.jsonl |
prompted_preference |
06 (ORPO), 07 (RM) |
5. Using Raw Data in Training
CLI --set Overrides
# SFT training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
--set data.format=raw \
--set data.task=sft \
--set data.path=data/sft_10k_raw.jsonl \
--set data.max_length=512
# DPO training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=512
# ORPO/RM training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
--set data.format=raw \
--set data.task=prompted_preference \
--set data.path=data/dpo_10k_raw.jsonl \
--set data.max_length=256
YAML Configuration File
data:
format: raw
task: sft
path: data/sft_10k_raw.jsonl
max_length: 512
# cache_dir: defaults to outputs/.cache (shared across all runs, override by specifying explicitly)
# num_proc: defaults to 50% of CPU cores (override by specifying explicitly)
- Shared cache: The default cache location is
outputs/.cache/. Even with different run directories (outputs/run_YYYYMMDD_HHMMSS/), tokenization is reused if the data and settings are identical. - Cache filename convention: Generated in the format
{input_filename}_{task}_{model_name}_len{max_length}.jsonl. For example:sft_10k_raw_sft_qwen3.5-0.8b-base_len512.jsonl. You can immediately identify which data, model, and settings produced the preprocessed result from the filename alone. - Multi-core: If
data.num_procis omitted, 50% of CPU cores are automatically used. You can force single-core withnum_proc: 1. - Column name mapping is available via
schema(defaults:prompt,response,chosen,rejected)
6. Offline Preprocessing (eulerforge preprocess)
To tokenize in advance instead of using automatic preprocessing:
# SFT (omitting --num-proc automatically uses 50% of CPU cores)
eulerforge preprocess \
--task sft \
--input data/sft_10k_raw.jsonl \
--output data/sft_10k_processed.jsonl \
--model-name Qwen/Qwen3.5-0.8B-Base
# Prompted-Preference
eulerforge preprocess \
--task prompted_preference \
--input data/dpo_10k_raw.jsonl \
--output data/dpo_10k_processed.jsonl \
--model-name Qwen/Qwen3.5-0.8B-Base \
--max-length 256
# Explicitly specify the number of workers
eulerforge preprocess \
--task sft --input data/sft_10k_raw.jsonl --output data/sft_processed.jsonl \
--model-name Qwen/Qwen3.5-0.8B-Base --num-proc 8
When using --num-proc > 1, EulerForge automatically sets TOKENIZERS_PARALLELISM=false. The default is 50% of CPU cores; to force single-core, specify --num-proc 1.
7. Labels Masking Policy
| Mode | Labels Composition |
|---|---|
| SFT text-only | labels = input_ids (full causal LM) |
| SFT prompt/response | Prompt tokens -100, loss only on response |
| Preference | Full sequence labels for each chosen/rejected |
| Prompted-Preference | Prompt tokens -100, loss/logp only on completion |
8. Common Errors and Solutions
Data: processed dataset row 0 missing 'input_ids'
Data: processed dataset row 0 missing 'input_ids'
Fix: Run `eulerforge preprocess ...` or set data.format=raw with schema mapping
See: docs/tutorials/00_data_preprocessing.md
Cause: Raw JSONL was specified as processed, or the file has not been preprocessed
Solution: Change to data.format=raw or preprocess with eulerforge preprocess
Data Config: data.task is required for raw format
Cause: data.format=raw is set but data.task is not specified
Solution: Set data.task to one of sft, preference, or prompted_preference
Data Config: data.task='sft' is incompatible with training.type='dpo'
Cause: data.task and training.type are incompatible. For example: specifying SFT data (data.task=sft) with a DPO preset (training.type=dpo)
Solution: Override the training type as well with --set training.type=sft, or change data.task to match the training type
| data.task | Compatible training.type |
|---|---|
sft |
sft, ppo |
preference |
dpo, orpo, rm |
prompted_preference |
dpo, orpo, rm |
Data: raw sft data row 0 missing column 'text'
Cause: The data does not have a text column (when using prompt/response format)
Solution: Columns are auto-detected. If prompt+response columns exist, it works correctly