0. Data Preprocessing

Run this tutorial first. Converts raw data in the data/ directory to EulerForge standard raw JSONL format. All subsequent tutorials use the converted files with data.format=raw.

1. EulerForge Standard Raw Format

EulerForge supports 3 types of standard raw JSONL:

task	Standard raw schema	Purpose
`sft`	`{"prompt": "...", "response": "..."}` or `{"text": "..."}`	SFT, PPO
`preference`	`{"chosen": "...", "rejected": "..."}`	Preference (no prompt)
`prompted_preference`	`{"prompt": "...", "chosen": "...", "rejected": "..."}`	DPO, ORPO, RM

When using data.format=raw, the data is automatically tokenized (converted to processed format) at the start of training.

2. Provided Data Files

The data/ directory contains original files and already-converted _raw files.

File List

File	Description	Size
`sft_10k.jsonl`	SFT original (nested messages structure)	10k
`sft_10k_raw.jsonl`	Standard SFT raw (converted)	10k
`sft_1k.jsonl`	SFT original (for TDD/testing)	1k
`sft_1k_bench_raw.jsonl`	SFT bench data (converted)	1k
`dpo_10k.jsonl`	DPO original (nested structure)	10k
`dpo_10k_raw.jsonl`	Standard prompted_preference raw (converted)	10k
`dpo_1k.jsonl`	DPO original (for TDD/testing)	1k
`dpo_1k_bench_raw.jsonl`	DPO bench data (converted)	1k
`dpo_10k.jsonl`	DPO preference data (already close to standard)	10k
`dpo_10k_raw.jsonl`	Standard prompted_preference raw (converted)	10k
`dpo_1k.jsonl`	DPO preference data (for TDD/testing)	1k
`dpo_1k_bench_raw.jsonl`	SHP bench data (converted)	1k

Naming Convention

Original (*.jsonl): Pre-conversion data. May have nested structures
_raw (*_raw.jsonl): Converted to standard columns (prompt, response, chosen, rejected)
_bench_raw (*_bench_raw.jsonl): For eulerforge bench testing (converted)
1k: For TDD/development (1k rows). May cause errors during training due to insufficient data
10k: Tutorial/production baseline (10k rows)

All tutorials assume 10k data. Since _raw files are already converted, you can skip the convert step in most tutorials.

3. Data Conversion (`eulerforge convert`)

The core philosophy of eulerforge convert is --map-based mapping. Built-in recipes (--recipe) are convenience features for commonly used structures.

Field Exploration (`--print-sample-flat`)

Inspect the field structure of the input file before conversion:

eulerforge convert --task sft --input data/sft_10k.jsonl --print-sample-flat
# Example output:
# --- row 0 flat keys ---
#   'json_record.messages'                          (list)  [{'role': 'system', ...}]

(A) sft_10k.jsonl to Standard SFT Raw

sft_10k.jsonl has a messages array structure. Choose one of two methods:

Method 1: Recipe mode (messages array)

eulerforge convert \
    --task sft \
    --input data/sft_10k.jsonl \
    --output data/sft_10k_raw.jsonl \
    --recipe sft_messages \
    --messages-expr json_record.messages

Method 2: --map mode (custom flat fields)

If your custom data has flat fields like instruction, output:

eulerforge convert \
    --task sft \
    --input data/custom.jsonl \
    --output data/out_raw.jsonl \
    --map prompt=instruction \
    --map response=output

Validation:

head -1 data/sft_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'response']

(B) dpo_10k.jsonl to Standard prompted_preference Raw

dpo_10k.jsonl has a nested {instruction.value, chosen.value, rejected.value} structure:

Method 1: --map mode (automatic dot-path flatten)

eulerforge convert \
    --task prompted_preference \
    --input data/dpo_10k.jsonl \
    --output data/dpo_10k_raw.jsonl \
    --map prompt=instruction.value \
    --map chosen=chosen.value \
    --map rejected=rejected.value

Method 2: Recipe mode

eulerforge convert \
    --task prompted_preference \
    --input data/dpo_10k.jsonl \
    --output data/dpo_10k_raw.jsonl \
    --recipe dpo_nested_v1

Validation:

head -1 data/dpo_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'chosen', 'rejected']

(C) dpo_10k.jsonl to Standard prompted_preference Raw

dpo_10k.jsonl already has a near-standard {prompt, chosen, rejected} structure. The converted dpo_10k_raw.jsonl is already provided. Schema validation:

eulerforge convert --task prompted_preference --input data/dpo_10k_raw.jsonl --validate
# Output: [Validate] 10000/10000 valid rows (0 invalid)

Multi-CPU Conversion

For large datasets, use --num-proc for parallel processing:

eulerforge convert \
    --task sft \
    --input data/sft_10k.jsonl \
    --output data/sft_10k_raw.jsonl \
    --recipe sft_messages \
    --messages-expr json_record.messages \
    --num-proc 4 \
    --overwrite

4. Conversion Results Summary

Converted File	task	Used in Tutorials
`data/sft_10k_raw.jsonl`	`sft`	01-04 (SFT strategies), 08 (PPO)
`data/dpo_10k_raw.jsonl`	`prompted_preference`	05 (DPO)
`data/dpo_10k_raw.jsonl`	`prompted_preference`	06 (ORPO), 07 (RM)

5. Using Raw Data in Training

CLI `--set` Overrides

# SFT training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

# DPO training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512

# ORPO/RM training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=256

YAML Configuration File

data:
  format: raw
  task: sft
  path: data/sft_10k_raw.jsonl
  max_length: 512
  # cache_dir: defaults to outputs/.cache (shared across all runs, override by specifying explicitly)
  # num_proc: defaults to 50% of CPU cores (override by specifying explicitly)

Shared cache: The default cache location is outputs/.cache/. Even with different run directories (outputs/run_YYYYMMDD_HHMMSS/), tokenization is reused if the data and settings are identical.
Cache filename convention: Generated in the format {input_filename}_{task}_{model_name}_len{max_length}.jsonl. For example: sft_10k_raw_sft_qwen3.5-0.8b-base_len512.jsonl. You can immediately identify which data, model, and settings produced the preprocessed result from the filename alone.
Multi-core: If data.num_proc is omitted, 50% of CPU cores are automatically used. You can force single-core with num_proc: 1.
Column name mapping is available via schema (defaults: prompt, response, chosen, rejected)

6. Offline Preprocessing (eulerforge preprocess)

To tokenize in advance instead of using automatic preprocessing:

# SFT (omitting --num-proc automatically uses 50% of CPU cores)
eulerforge preprocess \
  --task sft \
  --input data/sft_10k_raw.jsonl \
  --output data/sft_10k_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base

# Prompted-Preference
eulerforge preprocess \
  --task prompted_preference \
  --input data/dpo_10k_raw.jsonl \
  --output data/dpo_10k_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base \
  --max-length 256

# Explicitly specify the number of workers
eulerforge preprocess \
  --task sft --input data/sft_10k_raw.jsonl --output data/sft_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base --num-proc 8

When using --num-proc > 1, EulerForge automatically sets TOKENIZERS_PARALLELISM=false. The default is 50% of CPU cores; to force single-core, specify --num-proc 1.

7. Labels Masking Policy

Mode	Labels Composition
SFT text-only	`labels = input_ids` (full causal LM)
SFT prompt/response	Prompt tokens `-100`, loss only on response
Preference	Full sequence labels for each chosen/rejected
Prompted-Preference	Prompt tokens `-100`, loss/logp only on completion

8. Common Errors and Solutions

`Data: processed dataset row 0 missing 'input_ids'`

Data: processed dataset row 0 missing 'input_ids'
Fix: Run `eulerforge preprocess ...` or set data.format=raw with schema mapping
See: docs/tutorials/00_data_preprocessing.md

Cause: Raw JSONL was specified as processed, or the file has not been preprocessed Solution: Change to data.format=raw or preprocess with eulerforge preprocess

`Data Config: data.task is required for raw format`

Cause: data.format=raw is set but data.task is not specified Solution: Set data.task to one of sft, preference, or prompted_preference

`Data Config: data.task='sft' is incompatible with training.type='dpo'`

Cause: data.task and training.type are incompatible. For example: specifying SFT data (data.task=sft) with a DPO preset (training.type=dpo) Solution: Override the training type as well with --set training.type=sft, or change data.task to match the training type

data.task	Compatible training.type
`sft`	`sft`, `ppo`
`preference`	`dpo`, `orpo`, `rm`
`prompted_preference`	`dpo`, `orpo`, `rm`

`Data: raw sft data row 0 missing column 'text'`

Cause: The data does not have a text column (when using prompt/response format) Solution: Columns are auto-detected. If prompt+response columns exist, it works correctly

← Prev Getting Started 1. Dense LoRA Next →