Home > EulerForge > Tutorials > 0. Data Preprocessing

0. Data Preprocessing

Run this tutorial first. Converts raw data in the data/ directory to EulerForge standard raw JSONL format. All subsequent tutorials use the converted files with data.format=raw.


1. EulerForge Standard Raw Format

EulerForge supports 3 types of standard raw JSONL:

task Standard raw schema Purpose
sft {"prompt": "...", "response": "..."} or {"text": "..."} SFT, PPO
preference {"chosen": "...", "rejected": "..."} Preference (no prompt)
prompted_preference {"prompt": "...", "chosen": "...", "rejected": "..."} DPO, ORPO, RM

When using data.format=raw, the data is automatically tokenized (converted to processed format) at the start of training.


2. Provided Data Files

The data/ directory contains original files and already-converted _raw files.

File List

File Description Size
sft_10k.jsonl SFT original (nested messages structure) 10k
sft_10k_raw.jsonl Standard SFT raw (converted) 10k
sft_1k.jsonl SFT original (for TDD/testing) 1k
sft_1k_bench_raw.jsonl SFT bench data (converted) 1k
dpo_10k.jsonl DPO original (nested structure) 10k
dpo_10k_raw.jsonl Standard prompted_preference raw (converted) 10k
dpo_1k.jsonl DPO original (for TDD/testing) 1k
dpo_1k_bench_raw.jsonl DPO bench data (converted) 1k
dpo_10k.jsonl DPO preference data (already close to standard) 10k
dpo_10k_raw.jsonl Standard prompted_preference raw (converted) 10k
dpo_1k.jsonl DPO preference data (for TDD/testing) 1k
dpo_1k_bench_raw.jsonl SHP bench data (converted) 1k

Naming Convention

All tutorials assume 10k data. Since _raw files are already converted, you can skip the convert step in most tutorials.


3. Data Conversion (eulerforge convert)

The core philosophy of eulerforge convert is --map-based mapping. Built-in recipes (--recipe) are convenience features for commonly used structures.

Field Exploration (--print-sample-flat)

Inspect the field structure of the input file before conversion:

eulerforge convert --task sft --input data/sft_10k.jsonl --print-sample-flat
# Example output:
# --- row 0 flat keys ---
#   'json_record.messages'                          (list)  [{'role': 'system', ...}]

(A) sft_10k.jsonl to Standard SFT Raw

sft_10k.jsonl has a messages array structure. Choose one of two methods:

Method 1: Recipe mode (messages array)

eulerforge convert \
    --task sft \
    --input data/sft_10k.jsonl \
    --output data/sft_10k_raw.jsonl \
    --recipe sft_messages \
    --messages-expr json_record.messages

Method 2: --map mode (custom flat fields)

If your custom data has flat fields like instruction, output:

eulerforge convert \
    --task sft \
    --input data/custom.jsonl \
    --output data/out_raw.jsonl \
    --map prompt=instruction \
    --map response=output

Validation:

head -1 data/sft_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'response']

(B) dpo_10k.jsonl to Standard prompted_preference Raw

dpo_10k.jsonl has a nested {instruction.value, chosen.value, rejected.value} structure:

Method 1: --map mode (automatic dot-path flatten)

eulerforge convert \
    --task prompted_preference \
    --input data/dpo_10k.jsonl \
    --output data/dpo_10k_raw.jsonl \
    --map prompt=instruction.value \
    --map chosen=chosen.value \
    --map rejected=rejected.value

Method 2: Recipe mode

eulerforge convert \
    --task prompted_preference \
    --input data/dpo_10k.jsonl \
    --output data/dpo_10k_raw.jsonl \
    --recipe dpo_nested_v1

Validation:

head -1 data/dpo_10k_raw.jsonl | python3 -c "import sys,json; row=json.loads(sys.stdin.readline()); print(list(row.keys()))"
# Output: ['prompt', 'chosen', 'rejected']

(C) dpo_10k.jsonl to Standard prompted_preference Raw

dpo_10k.jsonl already has a near-standard {prompt, chosen, rejected} structure. The converted dpo_10k_raw.jsonl is already provided. Schema validation:

eulerforge convert --task prompted_preference --input data/dpo_10k_raw.jsonl --validate
# Output: [Validate] 10000/10000 valid rows (0 invalid)

Multi-CPU Conversion

For large datasets, use --num-proc for parallel processing:

eulerforge convert \
    --task sft \
    --input data/sft_10k.jsonl \
    --output data/sft_10k_raw.jsonl \
    --recipe sft_messages \
    --messages-expr json_record.messages \
    --num-proc 4 \
    --overwrite

4. Conversion Results Summary

Converted File task Used in Tutorials
data/sft_10k_raw.jsonl sft 01-04 (SFT strategies), 08 (PPO)
data/dpo_10k_raw.jsonl prompted_preference 05 (DPO)
data/dpo_10k_raw.jsonl prompted_preference 06 (ORPO), 07 (RM)

5. Using Raw Data in Training

CLI --set Overrides

# SFT training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_sft.yml \
    --set data.format=raw \
    --set data.task=sft \
    --set data.path=data/sft_10k_raw.jsonl \
    --set data.max_length=512

# DPO training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_moe_expert_lora_dpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=512

# ORPO/RM training (raw)
eulerforge train --preset configs/presets/qwen3.5_0.8b_dense_lora_orpo.yml \
    --set data.format=raw \
    --set data.task=prompted_preference \
    --set data.path=data/dpo_10k_raw.jsonl \
    --set data.max_length=256

YAML Configuration File

data:
  format: raw
  task: sft
  path: data/sft_10k_raw.jsonl
  max_length: 512
  # cache_dir: defaults to outputs/.cache (shared across all runs, override by specifying explicitly)
  # num_proc: defaults to 50% of CPU cores (override by specifying explicitly)

6. Offline Preprocessing (eulerforge preprocess)

To tokenize in advance instead of using automatic preprocessing:

# SFT (omitting --num-proc automatically uses 50% of CPU cores)
eulerforge preprocess \
  --task sft \
  --input data/sft_10k_raw.jsonl \
  --output data/sft_10k_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base

# Prompted-Preference
eulerforge preprocess \
  --task prompted_preference \
  --input data/dpo_10k_raw.jsonl \
  --output data/dpo_10k_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base \
  --max-length 256

# Explicitly specify the number of workers
eulerforge preprocess \
  --task sft --input data/sft_10k_raw.jsonl --output data/sft_processed.jsonl \
  --model-name Qwen/Qwen3.5-0.8B-Base --num-proc 8

When using --num-proc > 1, EulerForge automatically sets TOKENIZERS_PARALLELISM=false. The default is 50% of CPU cores; to force single-core, specify --num-proc 1.


7. Labels Masking Policy

Mode Labels Composition
SFT text-only labels = input_ids (full causal LM)
SFT prompt/response Prompt tokens -100, loss only on response
Preference Full sequence labels for each chosen/rejected
Prompted-Preference Prompt tokens -100, loss/logp only on completion

8. Common Errors and Solutions

Data: processed dataset row 0 missing 'input_ids'

Data: processed dataset row 0 missing 'input_ids'
Fix: Run `eulerforge preprocess ...` or set data.format=raw with schema mapping
See: docs/tutorials/00_data_preprocessing.md

Cause: Raw JSONL was specified as processed, or the file has not been preprocessed Solution: Change to data.format=raw or preprocess with eulerforge preprocess

Data Config: data.task is required for raw format

Cause: data.format=raw is set but data.task is not specified Solution: Set data.task to one of sft, preference, or prompted_preference

Data Config: data.task='sft' is incompatible with training.type='dpo'

Cause: data.task and training.type are incompatible. For example: specifying SFT data (data.task=sft) with a DPO preset (training.type=dpo) Solution: Override the training type as well with --set training.type=sft, or change data.task to match the training type

data.task Compatible training.type
sft sft, ppo
preference dpo, orpo, rm
prompted_preference dpo, orpo, rm

Data: raw sft data row 0 missing column 'text'

Cause: The data does not have a text column (when using prompt/response format) Solution: Columns are auto-detected. If prompt+response columns exist, it works correctly