Tutorial 1: Quick Start — From new to export
This tutorial walks you through the full eulerweave lifecycle: create a manifest with new, validate it with validate, preview the execution plan with plan, and run the pipeline with run to generate a training JSONL file.
Prerequisites: eulerweave must be installed. See Tutorial 0: Installation.
Overall Flow
eulerweave new Scaffold manifest YAML
↓
eulerweave validate Pre-check for errors
↓
eulerweave plan Preview execution plan
↓
eulerweave run Run pipeline → JSONL output
Step 1: Prepare Sample Data
Create a JSONL file to use as input:
mkdir -p data
cat > data/train.jsonl << 'EOF'
{"instruction": "Explain photosynthesis.", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "output": "The capital of France is Paris."}
{"instruction": "Describe the water cycle.", "output": "The water cycle describes the continuous movement of water through evaporation, condensation, precipitation, and collection."}
{"instruction": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence where systems learn patterns from data to make predictions without being explicitly programmed."}
{"instruction": "Explain Newton's first law.", "output": "Newton's first law states that an object at rest stays at rest, and an object in motion stays in motion at constant velocity, unless acted upon by an external force."}
EOF
Step 2: eulerweave new — Scaffold the Manifest
Generate a manifest for the SFT (Supervised Fine-Tuning) track:
eulerweave new manifest.yaml --track sft
[eulerweave] Created manifest: manifest.yaml (track=sft)
[eulerweave] Edit the file to configure inputs, pipeline blocks, and exports.
Generated manifest.yaml:
version: 1
track: sft
inputs:
- type: jsonl
uri: data/train.jsonl
pipeline:
- id: norm1
type: normalize_text
slot: normalize
input_type: TextDocument
output_type: TextDocument
- id: filter1
type: heuristic_filter
slot: filter
input_type: TextDocument
output_type: TextDocument
params:
min_length: 50
max_length: 10000
- id: sft1
type: build_sft_messages
slot: build_task
input_type: TextDocument
output_type: SFTMessages
- id: exp1
type: export_jsonl
slot: export
input_type: SFTMessages
output_type: ExportedDataset
exports:
- type: jsonl
path: out/result.jsonl
profile:
cpu_only: false
allow_external_llm: true
Manifest Key Concepts
| Section | Purpose |
|---|---|
version |
Manifest schema version (currently 1) |
track |
Pipeline type: pretrain, sft, dpo |
inputs |
Data sources (JSONL, CSV, PDF, etc.) |
pipeline |
Ordered list of typed processing blocks |
exports |
Final output location and format |
profile |
Runtime policies (e.g., LLM usage permission) |
Step 3: eulerweave validate — Error Checking
eulerweave validate manifest.yaml
[eulerweave] Validating manifest: manifest.yaml
[eulerweave] Parsing YAML ... OK
[eulerweave] Checking version field ... OK
[eulerweave] Checking track field ... OK (track=sft)
[eulerweave] Checking inputs ... OK (1 input)
[eulerweave] Checking pipeline blocks ... OK (4 blocks)
[eulerweave] - norm1: normalize_text OK
[eulerweave] - filter1: heuristic_filter OK
[eulerweave] - sft1: build_sft_messages OK
[eulerweave] - exp1: export_jsonl OK
[eulerweave] Checking type chain ... OK
[eulerweave] Checking slot ordering ... OK
[eulerweave] Checking exports ... OK (1 export)
✓ Manifest is valid.
If there are errors, specific messages and fix suggestions will be shown. See Tutorial 6: Validation for details.
Step 4: eulerweave plan — Execution Preview
eulerweave plan manifest.yaml --records 10000
[eulerweave] Planning execution for: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Estimated input records: 10,000
Execution Plan
+---------+-------------------+--------+-------------------+-------------------+
| Step | Block | Slot | Input Type | Output Type |
+---------+-------------------+--------+-------------------+-------------------+
| 1 | norm1 | norm.. | TextDocument | TextDocument |
| 2 | filter1 | filter | TextDocument | TextDocument |
| 3 | sft1 | build..| TextDocument | SFTMessages |
| 4 | exp1 | export | SFTMessages | ExportedDataset |
+---------+-------------------+--------+-------------------+-------------------+
[eulerweave] No policy violations detected.
[eulerweave] Plan looks good. Run with: eulerweave run manifest.yaml
Step 5: eulerweave run — Run the Pipeline
eulerweave run manifest.yaml \
--input data/train.jsonl \
--artifacts ./artifacts
[eulerweave] Loading manifest: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Input: data/train.jsonl (5 records)
[eulerweave] Starting pipeline execution ...
[eulerweave] [1/4] norm1 (normalize_text) ...
[eulerweave] Processed 5 documents, 0 dropped
[eulerweave] [2/4] filter1 (heuristic_filter) ...
[eulerweave] Processed 5 documents, 0 filtered
[eulerweave] [3/4] sft1 (build_sft_messages) ...
[eulerweave] Built 5 SFT message sets
[eulerweave] [4/4] exp1 (export_jsonl) ...
[eulerweave] Exported 5 records to out/result.jsonl
[eulerweave] Pipeline completed successfully.
What Happened
- norm1 — Cleaned whitespace in each document's text. Removed empty documents.
- filter1 — Applied length constraints (removed documents under 50 or over 10,000 characters).
- sft1 — Converted each document to SFT message format (
instruction→outputmapping). - exp1 — Wrote final SFT records to
out/result.jsonl.
Check Output
head -2 out/result.jsonl
{"instruction": "Explain photosynthesis.", "input": "", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
Step 6: Using Other Input Formats
eulerweave supports various input formats beyond JSONL. Simply change inputs[].type:
CSV Input
inputs:
- type: csv
uri: data/train.csv
options:
text_column: content
delimiter: ","
PDF Input
inputs:
- type: pdf
uri: data/document.pdf
options:
strategy: auto # auto, text, ocr
page_range: "1-10" # 선택 사항: 페이지 범위 지정
A full pipeline using PDF input is covered in detail in Tutorial 2: PDF to Training Data.
Parquet Input
inputs:
- type: parquet
uri: data/train.parquet
options:
text_column: text
Full Workflow Summary
# 1. 스캐폴딩
eulerweave new manifest.yaml --track sft
# 2. manifest.yaml 편집
# 3. 검증
eulerweave validate manifest.yaml
# 4. 실행 계획 확인
eulerweave plan manifest.yaml --records 10000
# 5. 실행
eulerweave run manifest.yaml --input data/train.jsonl --artifacts ./artifacts
# 6. 출력 확인
head -5 out/result.jsonl
Next Steps
- Tutorial 2: PDF to Training Data — Generate SFT training data from local PDFs
- Tutorial 3: HuggingFace to Training Data — Download HF datasets and convert to training data
- Tutorial 4: SFT Track Deep Dive — Compare three SFT builder blocks
- Tutorial 6: Validation — Resolve manifest validation errors