튜토리얼 1: 빠른 시작 — `new`에서 `export`까지

이 튜토리얼에서는 eulerweave의 전체 라이프사이클을 안내합니다: 매니페스트를 new로 생성하고, validate로 검증하고, plan으로 실행 계획을 미리 확인하고, run으로 파이프라인을 실행하여 학습용 JSONL 파일을 생성합니다.

사전 요구 사항: eulerweave가 설치되어 있어야 합니다. 튜토리얼 0: 설치를 참고하세요.

전체 흐름

eulerweave new       매니페스트 YAML 스캐폴딩
       ↓
eulerweave validate  오류 사전 검사
       ↓
eulerweave plan      실행 계획 미리보기
       ↓
eulerweave run       파이프라인 실행 → JSONL 출력

단계 1: 샘플 데이터 준비

입력으로 사용할 JSONL 파일을 만듭니다:

mkdir -p data
cat > data/train.jsonl << 'EOF'
{"instruction": "Explain photosynthesis.", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "output": "The capital of France is Paris."}
{"instruction": "Describe the water cycle.", "output": "The water cycle describes the continuous movement of water through evaporation, condensation, precipitation, and collection."}
{"instruction": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence where systems learn patterns from data to make predictions without being explicitly programmed."}
{"instruction": "Explain Newton's first law.", "output": "Newton's first law states that an object at rest stays at rest, and an object in motion stays in motion at constant velocity, unless acted upon by an external force."}
EOF

단계 2: `eulerweave new` — 매니페스트 스캐폴딩

SFT(Supervised Fine-Tuning) 트랙의 매니페스트를 생성합니다:

eulerweave new manifest.yaml --track sft

[eulerweave] Created manifest: manifest.yaml (track=sft)
[eulerweave] Edit the file to configure inputs, pipeline blocks, and exports.

생성된 manifest.yaml:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 50
      max_length: 10000

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

매니페스트 핵심 개념

섹션	용도
`version`	매니페스트 스키마 버전 (현재 `1`)
`track`	파이프라인 유형: `pretrain`, `sft`, `dpo`
`inputs`	데이터 소스 (JSONL, CSV, PDF 등)
`pipeline`	타입이 지정된 처리 블록의 정렬된 목록
`exports`	최종 출력 위치와 형식
`profile`	런타임 정책 (LLM 허용 여부 등)

단계 3: `eulerweave validate` — 오류 검사

eulerweave validate manifest.yaml

[eulerweave] Validating manifest: manifest.yaml
[eulerweave] Parsing YAML ...                              OK
[eulerweave] Checking version field ...                     OK
[eulerweave] Checking track field ...                       OK  (track=sft)
[eulerweave] Checking inputs ...                            OK  (1 input)
[eulerweave] Checking pipeline blocks ...                   OK  (4 blocks)
[eulerweave]   - norm1: normalize_text                      OK
[eulerweave]   - filter1: heuristic_filter                  OK
[eulerweave]   - sft1: build_sft_messages                   OK
[eulerweave]   - exp1: export_jsonl                         OK
[eulerweave] Checking type chain ...                        OK
[eulerweave] Checking slot ordering ...                     OK
[eulerweave] Checking exports ...                           OK  (1 export)

✓ Manifest is valid.

오류가 있을 경우 구체적인 메시지와 수정 방법을 알려줍니다. 자세한 내용은 튜토리얼 6: 검증을 참고하세요.

단계 4: `eulerweave plan` — 실행 미리보기

eulerweave plan manifest.yaml --records 10000

[eulerweave] Planning execution for: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Estimated input records: 10,000

 Execution Plan
+---------+-------------------+--------+-------------------+-------------------+
| Step    | Block             | Slot   | Input Type        | Output Type       |
+---------+-------------------+--------+-------------------+-------------------+
| 1       | norm1             | norm.. | TextDocument      | TextDocument      |
| 2       | filter1           | filter | TextDocument      | TextDocument      |
| 3       | sft1              | build..| TextDocument      | SFTMessages       |
| 4       | exp1              | export | SFTMessages       | ExportedDataset   |
+---------+-------------------+--------+-------------------+-------------------+

[eulerweave] No policy violations detected.
[eulerweave] Plan looks good. Run with: eulerweave run manifest.yaml

단계 5: `eulerweave run` — 파이프라인 실행

eulerweave run manifest.yaml \
  --input data/train.jsonl \
  --artifacts ./artifacts

[eulerweave] Loading manifest: manifest.yaml
[eulerweave] Track: sft
[eulerweave] Input: data/train.jsonl (5 records)

[eulerweave] Starting pipeline execution ...
[eulerweave] [1/4] norm1 (normalize_text) ...
[eulerweave]   Processed 5 documents, 0 dropped
[eulerweave] [2/4] filter1 (heuristic_filter) ...
[eulerweave]   Processed 5 documents, 0 filtered
[eulerweave] [3/4] sft1 (build_sft_messages) ...
[eulerweave]   Built 5 SFT message sets
[eulerweave] [4/4] exp1 (export_jsonl) ...
[eulerweave]   Exported 5 records to out/result.jsonl

[eulerweave] Pipeline completed successfully.

수행된 작업

norm1 — 각 문서의 텍스트에서 공백을 정리. 빈 문서는 제거.
filter1 — 길이 제약 적용 (50자 미만 또는 10,000자 초과 문서 제거).
sft1 — 각 문서를 SFT 메시지 포맷으로 변환 (instruction → output 매핑).
exp1 — 최종 SFT 레코드를 out/result.jsonl에 기록.

출력 확인

head -2 out/result.jsonl

{"instruction": "Explain photosynthesis.", "input": "", "output": "Photosynthesis is the process by which green plants convert sunlight into chemical energy, producing glucose and oxygen from carbon dioxide and water."}
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

단계 6: 다른 입력 형식 사용하기

eulerweave는 JSONL 외에도 다양한 입력 형식을 지원합니다. inputs[].type만 변경하면 됩니다:

CSV 입력

inputs:
  - type: csv
    uri: data/train.csv
    options:
      text_column: content
      delimiter: ","

PDF 입력

inputs:
  - type: pdf
    uri: data/document.pdf
    options:
      strategy: auto       # auto, text, ocr
      page_range: "1-10"   # 선택 사항: 페이지 범위 지정

PDF 입력을 사용한 전체 파이프라인은 튜토리얼 2: PDF → 훈련 데이터에서 자세히 다룹니다.

Parquet 입력

inputs:
  - type: parquet
    uri: data/train.parquet
    options:
      text_column: text

전체 워크플로우 요약

# 1. 스캐폴딩
eulerweave new manifest.yaml --track sft

# 2. manifest.yaml 편집

# 3. 검증
eulerweave validate manifest.yaml

# 4. 실행 계획 확인
eulerweave plan manifest.yaml --records 10000

# 5. 실행
eulerweave run manifest.yaml --input data/train.jsonl --artifacts ./artifacts

# 6. 출력 확인
head -5 out/result.jsonl

다음 단계

튜토리얼 2: PDF → 훈련 데이터 — 로컬 PDF에서 SFT 훈련 데이터 생성
튜토리얼 3: HuggingFace → 훈련 데이터 — HF 데이터셋을 다운받아 훈련 데이터로 변환
튜토리얼 4: SFT 트랙 심화 — 세 가지 SFT 빌더 블록 비교
튜토리얼 6: 검증 — 매니페스트 검증 오류 해결

튜토리얼 1: 빠른 시작 — new에서 export까지