Tutorial 1: Quick Start — From `new` to `export`

This tutorial walks you through the full eulerweave lifecycle: create a manifest with new, validate it with validate, preview the execution plan with plan, and run the pipeline with run to generate a training JSONL file.

Prerequisites: eulerweave must be installed. See Tutorial 0: Installation.

Overall Flow

eulerweave new       Scaffold manifest YAML
       ↓
eulerweave validate  Pre-check for errors
       ↓
eulerweave plan      Preview execution plan
       ↓
eulerweave run       Run pipeline → JSONL output

Step 1: Prepare Sample Data

Create a JSONL file to use as input:

mkdir -p data
cat > data/train.jsonl << 'EOF'
{"instruction": "Explain photosynthesis.", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "output": "The capital of France is Paris."}
{"instruction": "Describe the water cycle.", "output": "The water cycle describes the continuous movement of water through evaporation, condensation, precipitation, and collection."}
{"instruction": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence where systems learn patterns from data to make predictions without being explicitly programmed."}
{"instruction": "Explain Newton's first law.", "output": "Newton's first law states that an object at rest stays at rest, and an object in motion stays in motion at constant velocity, unless acted upon by an external force."}
EOF

Step 2: `eulerweave new` — Scaffold the Manifest

Generate a manifest for the SFT (Supervised Fine-Tuning) track:

eulerweave new manifest.yaml --track sft

[eulerweave] Created manifest: manifest.yaml (track=sft)
[eulerweave] Edit the file to configure inputs, pipeline blocks, and exports.

Generated manifest.yaml:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 50
      max_length: 10000

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

Manifest Key Concepts

Section	Purpose
`version`	Manifest schema version (currently `1`)
`track`	Pipeline type: `pretrain`, `sft`, `dpo`
`inputs`	Data sources (JSONL, CSV, PDF, etc.)
`pipeline`	Ordered list of typed processing blocks
`exports`	Final output location and format
`profile`	Runtime policies (e.g., LLM usage permission)

Step 3: `eulerweave validate` — Error Checking

eulerweave validate manifest.yaml

[eulerweave] Validating manifest: manifest.yaml
[eulerweave] Parsing YAML ...                              OK
[eulerweave] Checking version field ...                     OK
[eulerweave] Checking track field ...                       OK  (track=sft)
[eulerweave] Checking inputs ...                            OK  (1 input)
[eulerweave] Checking pipeline blocks ...                   OK  (4 blocks)
[eulerweave]   - norm1: normalize_text                      OK
[eulerweave]   - filter1: heuristic_filter                  OK
[eulerweave]   - sft1: build_sft_messages                   OK
[eulerweave]   - exp1: export_jsonl                         OK
[eulerweave] Checking type chain ...                        OK
[eulerweave] Checking slot ordering ...                     OK
[eulerweave] Checking exports ...                           OK  (1 export)

✓ Manifest is valid.

If there are errors, specific messages and fix suggestions will be shown. See Tutorial 6: Validation for details.

Step 4: `eulerweave plan` — Execution Preview

eulerweave plan manifest.yaml --records 10000

[eulerweave] Planning execution for: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Estimated input records: 10,000

 Execution Plan
+---------+-------------------+--------+-------------------+-------------------+
| Step    | Block             | Slot   | Input Type        | Output Type       |
+---------+-------------------+--------+-------------------+-------------------+
| 1       | norm1             | norm.. | TextDocument      | TextDocument      |
| 2       | filter1           | filter | TextDocument      | TextDocument      |
| 3       | sft1              | build..| TextDocument      | SFTMessages       |
| 4       | exp1              | export | SFTMessages       | ExportedDataset   |
+---------+-------------------+--------+-------------------+-------------------+

[eulerweave] No policy violations detected.
[eulerweave] Plan looks good. Run with: eulerweave run manifest.yaml

Step 5: `eulerweave run` — Run the Pipeline

eulerweave run manifest.yaml \
  --input data/train.jsonl \
  --artifacts ./artifacts

[eulerweave] Loading manifest: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Input: data/train.jsonl (5 records)

[eulerweave] Starting pipeline execution ...
[eulerweave] [1/4] norm1 (normalize_text) ...
[eulerweave]   Processed 5 documents, 0 dropped
[eulerweave] [2/4] filter1 (heuristic_filter) ...
[eulerweave]   Processed 5 documents, 0 filtered
[eulerweave] [3/4] sft1 (build_sft_messages) ...
[eulerweave]   Built 5 SFT message sets
[eulerweave] [4/4] exp1 (export_jsonl) ...
[eulerweave]   Exported 5 records to out/result.jsonl

[eulerweave] Pipeline completed successfully.

What Happened

norm1 — Cleaned whitespace in each document's text. Removed empty documents.
filter1 — Applied length constraints (removed documents under 50 or over 10,000 characters).
sft1 — Converted each document to SFT message format (instruction → output mapping).
exp1 — Wrote final SFT records to out/result.jsonl.

Check Output

head -2 out/result.jsonl

{"instruction": "Explain photosynthesis.", "input": "", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

Step 6: Using Other Input Formats

eulerweave supports various input formats beyond JSONL. Simply change inputs[].type:

CSV Input

inputs:
  - type: csv
    uri: data/train.csv
    options:
      text_column: content
      delimiter: ","

PDF Input

inputs:
  - type: pdf
    uri: data/document.pdf
    options:
      strategy: auto       # auto, text, ocr
      page_range: "1-10"   # 선택 사항: 페이지 범위 지정

A full pipeline using PDF input is covered in detail in Tutorial 2: PDF to Training Data.

Parquet Input

inputs:
  - type: parquet
    uri: data/train.parquet
    options:
      text_column: text

Full Workflow Summary

# 1. 스캐폴딩
eulerweave new manifest.yaml --track sft

# 2. manifest.yaml 편집

# 3. 검증
eulerweave validate manifest.yaml

# 4. 실행 계획 확인
eulerweave plan manifest.yaml --records 10000

# 5. 실행
eulerweave run manifest.yaml --input data/train.jsonl --artifacts ./artifacts

# 6. 출력 확인
head -5 out/result.jsonl

Next Steps

Tutorial 2: PDF to Training Data — Generate SFT training data from local PDFs
Tutorial 3: HuggingFace to Training Data — Download HF datasets and convert to training data
Tutorial 4: SFT Track Deep Dive — Compare three SFT builder blocks
Tutorial 6: Validation — Resolve manifest validation errors

Tutorial 1: Quick Start — From new to export