튜토리얼 4: SFT 트랙 심화 — 세 가지 빌더 블록

이 튜토리얼에서는 SFT (Supervised Fine-Tuning) 트랙의 세 가지 태스크 빌딩 블록을 상세히 비교합니다.

사전 요구 사항: 튜토리얼 1: 빠른 시작을 완료해야 합니다.

SFT 트랙이란?

SFT 트랙은 언어 모델이 (instruction, output) 쌍으로부터 학습하는 훈련 패러다임을 위해 설계되었습니다. 매니페스트에 track: sft를 설정하면 SFT 전용 검증 규칙과 SFT 관련 블록이 활성화됩니다.

세 가지 build_task 블록 비교

블록	용도	LLM 필요	출력 타입
`build_sft_messages`	기존 필드를 SFT 형식으로 매핑	아니요	SFTMessages
`build_sft_qna`	LLM으로 다중 QnA 쌍 생성	예	SFTMessages
`build_langextract_qna`	LangExtract 방식 QnA 생성	예*	SFTQnA

* build_langextract_qna는 FakeProvider 모드에서 LLM 없이 실행 가능

어떤 블록을 사용할까?

소스 데이터에 instruction-output 쌍이 있는가?
  ├── 예 → build_sft_messages (LLM 불필요, 빠름)
  └── 아니요 (원시 텍스트만)
       ├── 문서당 여러 QnA 필요? → build_sft_qna
       └── 문서당 단일 QnA → build_langextract_qna

블록 1: `build_sft_messages` — 필드 매핑

소스 데이터에 이미 instruction-output 쌍이 있을 때 사용합니다. LLM 호출 없이 필드 이름을 SFT 스키마에 자동 매핑합니다.

필드 자동 매핑 규칙

instruction 후보 (우선순위): 1. instruction → 2. prompt → 3. question → 4. user → 5. input

해당 필드가 없으면 Document.text가 instruction으로 사용됩니다.

output 후보 (우선순위): 1. output → 2. answer → 3. response → 4. completion → 5. assistant

input/context (선택): 1. input (instruction 필드와 다른 경우) → 2. context

매니페스트 예시

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/alpaca.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/sft_result.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

지원하는 입력 형식

형식	예시	매핑 결과
Alpaca	`{"instruction": "...", "output": "..."}`	instruction → output
ShareGPT	`{"user": "...", "assistant": "..."}`	user → assistant
Q&A	`{"question": "...", "answer": "..."}`	question → answer
Prompt-Completion	`{"prompt": "...", "completion": "..."}`	prompt → completion
텍스트 전용	`{"text": "..."}`	text → instruction, output 빈 문자열

블록 2: `build_sft_qna` — LLM 다중 QnA 생성

원시 텍스트에서 LLM을 사용하여 문서당 여러 개의 질문-답변 쌍을 생성합니다.

매니페스트 예시

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/articles.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 200

  - id: qna1
    type: build_sft_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages
    params:
      model: "ollama:qwen3:14b"
      questions_per_doc: 3
      reply_lang: "en"

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/sft_qna.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

주요 매개변수

매개변수	기본값	설명
`model`	(필수)	`provider:model` 형식 (예: `ollama:qwen3:14b`)
`questions_per_doc`	`3`	문서당 생성할 QnA 쌍 수
`reply_lang`	`"en"`	생성 언어

요구 사항

Ollama 인스턴스가 실행 중이어야 함 (ollama serve)
allow_external_llm: true 필요
[llm] extra 설치 필요

출력 예시

입력 문서:

{"text": "The Great Wall of China is a series of fortifications..."}

출력 (3개 QnA 쌍):

{"instruction": "What is the Great Wall of China?", "input": "", "output": "The Great Wall of China is a series of fortifications..."}
{"instruction": "How long is the Great Wall?", "input": "", "output": "The total length is over 21,000 kilometers."}
{"instruction": "Where was it built?", "input": "", "output": "It was built across the historical northern borders..."}

블록 3: `build_langextract_qna` — LangExtract 방식 QnA 생성

build_langextract_qna는 텍스트에서 문서당 하나의 질문-답변 쌍을 생성합니다. FakeProvider를 사용하면 LLM 없이도 실행할 수 있어 테스트와 개발에 편리합니다.

FakeProvider 모드 (LLM 불필요)

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/paper.pdf

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 50

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/langextract_qna.jsonl

profile:
  cpu_only: true
  allow_external_llm: false     # FakeProvider 사용

FakeProvider 출력:

{
  "messages": [
    {"role": "user", "content": "What is the main topic of: 'Introduction...'?"},
    {"role": "assistant", "content": "The text discusses: Introduction... [fake-a1b2c3]"}
  ]
}

Ollama 모드 (실제 LLM 사용)

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"
      base_url: "http://localhost:11434"

프로필에 allow_external_llm: true 필요.

주요 매개변수

매개변수	기본값	설명
`model`	`"gpt-oss:20b"`	Ollama 모델명
`base_url`	`"http://localhost:11434"`	Ollama 서버 주소

출력 형식

build_langextract_qna의 출력은 SFTQnA 타입이며, messages 배열 형식입니다:

{
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

각 레코드에는 qna_source: "langextract" 어노테이션이 포함됩니다.

세 블록 상세 비교

특성	`build_sft_messages`	`build_sft_qna`	`build_langextract_qna`
입력 데이터	구조화 (instruction+output)	원시 텍스트	원시 텍스트
LLM 필요	아니요	예	예 (FakeProvider 가능)
문서당 QnA 수	1 (매핑)	N (설정 가능)	1
출력 타입	SFTMessages	SFTMessages	SFTQnA
출력 형식	instruction/output	instruction/output	messages 배열
속도	빠름	느림 (LLM 호출)	느림 (LLM 호출)
적합한 데이터	Alpaca, ShareGPT	위키, 뉴스, 교재	PDF, 논문, 매뉴얼
테스트 용이성	쉬움	LLM 필요	FakeProvider 가능

Slot 순서 규칙

SFT 트랙의 파이프라인 slot은 다음 순서를 따라야 합니다:

normalize → filter → dedup → enrich → metrics → build_task → export

Slot	필수 여부	용도
`normalize`	권장	텍스트 정리
`filter`	선택	품질/길이 필터링
`dedup`	선택	중복 제거
`enrich`	선택	메타데이터 보강
`metrics`	선택	품질 통계 수집 (pass-through)
`build_task`	필수	SFT 형식 변환
`export`	필수	출력 기록

타입 체인

TextDocument → TextDocument → TextDocument → SFTMessages/SFTQnA → ExportedDataset
  (normalize)   (filter)       (build_task)    (export)

실전 파이프라인 조합

PDF + LangExtract QnA + 메트릭

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/manual.pdf

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 100

  - id: dedup1
    type: dedup_exact
    slot: dedup
    input_type: TextDocument
    output_type: TextDocument

  - id: m1
    type: metrics_text_basic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m2
    type: metrics_quality_heuristic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/manual_qna.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

두 매니페스트 결과 병합

서로 다른 소스에서 생성된 훈련 데이터를 병합할 수 있습니다:

# 구조화된 데이터 처리
eulerweave run manifest_structured.yaml --artifacts ./art1

# 원시 텍스트 처리
eulerweave run manifest_raw.yaml --artifacts ./art2

# 결과 병합
cat out/structured.jsonl out/raw_qna.jsonl > out/combined.jsonl

# 레코드 수 확인
wc -l out/combined.jsonl

다음 단계

튜토리얼 2: PDF → 훈련 데이터 — PDF 파이프라인 상세
튜토리얼 3: HuggingFace → 훈련 데이터 — HF 데이터 처리
튜토리얼 5: 플러그인 개발 — 커스텀 추출기 만들기
튜토리얼 6: 검증 — 매니페스트 오류 해결

튜토리얼 4: SFT 트랙 심화 — 세 가지 빌더 블록

SFT 트랙이란?

세 가지 build_task 블록 비교

어떤 블록을 사용할까?

블록 1: build_sft_messages — 필드 매핑

필드 자동 매핑 규칙

매니페스트 예시

지원하는 입력 형식

블록 2: build_sft_qna — LLM 다중 QnA 생성

매니페스트 예시

주요 매개변수

요구 사항

출력 예시

블록 3: build_langextract_qna — LangExtract 방식 QnA 생성

FakeProvider 모드 (LLM 불필요)

Ollama 모드 (실제 LLM 사용)

주요 매개변수

출력 형식

세 블록 상세 비교

Slot 순서 규칙

타입 체인

실전 파이프라인 조합

PDF + LangExtract QnA + 메트릭

두 매니페스트 결과 병합

다음 단계

블록 1: `build_sft_messages` — 필드 매핑

블록 2: `build_sft_qna` — LLM 다중 QnA 생성

블록 3: `build_langextract_qna` — LangExtract 방식 QnA 생성