튜토리얼 3: HuggingFace 데이터셋 → SFT 훈련 데이터

이 튜토리얼에서는 HuggingFace에서 공개 데이터셋을 다운로드하고, eulerweave 파이프라인으로 처리하여 LLM 미세 조정용 훈련 데이터를 만드는 전체 과정을 안내합니다.

이 튜토리얼에서 배우는 내용:

HuggingFace datasets 라이브러리로 데이터 다운로드
다운로드한 데이터를 JSONL/CSV로 저장
eulerweave 매니페스트로 SFT 파이프라인 구성
구조화된 데이터(build_sft_messages)와 원시 텍스트(build_langextract_qna) 처리

사전 요구 사항: eulerweave가 설치되어 있어야 합니다. 튜토리얼 0: 설치를 참고하세요.

시나리오 A: 구조화된 SFT 데이터셋 (Alpaca 스타일)

HuggingFace에는 이미 instruction-output 쌍이 있는 SFT 데이터셋이 많습니다. 이런 데이터는 build_sft_messages로 바로 처리할 수 있습니다.

1단계: 데이터 다운로드

pip install datasets

# download_alpaca.py
from datasets import load_dataset
import json

# tatsu-lab/alpaca 데이터셋 다운로드
ds = load_dataset("tatsu-lab/alpaca", split="train")

# 처음 1000개만 JSONL로 저장
with open("data/alpaca_1k.jsonl", "w") as f:
    for row in ds.select(range(1000)):
        json.dump({
            "instruction": row["instruction"],
            "input": row["input"],
            "output": row["output"],
        }, f, ensure_ascii=False)
        f.write("\n")

print(f"Saved {min(1000, len(ds))} records to data/alpaca_1k.jsonl")

mkdir -p data
python download_alpaca.py
# Saved 1000 records to data/alpaca_1k.jsonl

2단계: 데이터 확인

head -1 data/alpaca_1k.jsonl | python -m json.tool

{
  "instruction": "Give three tips for staying healthy.",
  "input": "",
  "output": "1. Eat a balanced diet... 2. Exercise regularly... 3. Get enough sleep..."
}

이 데이터는 이미 instruction-output 구조가 있으므로 build_sft_messages를 사용합니다.

3단계: 매니페스트 작성

manifest_alpaca.yaml:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/alpaca_1k.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 20
      max_length: 10000

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/alpaca_sft.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

핵심 포인트:

build_sft_messages — 기존 필드를 SFT 형식으로 자동 매핑 (LLM 불필요)
cpu_only: true — CPU만 사용
allow_external_llm: false — 외부 LLM 호출 없음

4단계: 검증 및 실행

# 검증
eulerweave validate manifest_alpaca.yaml

# 실행
eulerweave run manifest_alpaca.yaml \
  --input data/alpaca_1k.jsonl \
  --artifacts ./artifacts

5단계: 결과 확인

wc -l out/alpaca_sft.jsonl
# 998 out/alpaca_sft.jsonl  (2개 필터링됨)

head -1 out/alpaca_sft.jsonl | python -m json.tool

{
  "instruction": "Give three tips for staying healthy.",
  "input": "",
  "output": "1. Eat a balanced diet... 2. Exercise regularly... 3. Get enough sleep..."
}

시나리오 B: 원시 텍스트 데이터셋 → QnA 생성

HuggingFace에는 원시 텍스트만 있는 데이터셋도 많습니다 (위키백과, 뉴스, 논문 등). 이런 데이터는 build_langextract_qna로 QnA 쌍을 자동 생성합니다.

1단계: 위키백과 데이터 다운로드

# download_wiki.py
from datasets import load_dataset
import json

# 한국어 위키백과 (또는 원하는 언어)
ds = load_dataset("wikipedia", "20220301.ko", split="train", streaming=True)

# 처음 500개 문서를 JSONL로 저장
with open("data/wiki_ko_500.jsonl", "w") as f:
    for i, row in enumerate(ds):
        if i >= 500:
            break
        if len(row["text"]) < 200:
            continue  # 너무 짧은 문서 건너뛰기
        json.dump({
            "text": row["text"][:5000],  # 5000자로 제한
            "title": row["title"],
        }, f, ensure_ascii=False)
        f.write("\n")

print("Saved to data/wiki_ko_500.jsonl")

python download_wiki.py

2단계: 데이터 확인

head -1 data/wiki_ko_500.jsonl | python -m json.tool

{
  "text": "지미 카터(James Earl Carter Jr., 1924년 10월 1일 ~ )는 미국의 제39대 대통령이다...",
  "title": "지미 카터"
}

원시 텍스트만 있으므로 LLM으로 QnA를 생성해야 합니다.

3단계: FakeProvider로 파이프라인 테스트

먼저 LLM 없이 파이프라인 구조를 테스트합니다.

manifest_wiki_test.yaml:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/wiki_ko_500.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 200

  - id: dedup1
    type: dedup_exact
    slot: dedup
    input_type: TextDocument
    output_type: TextDocument

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/wiki_qna.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

eulerweave validate manifest_wiki_test.yaml
eulerweave run manifest_wiki_test.yaml \
  --input data/wiki_ko_500.jsonl \
  --artifacts ./artifacts

FakeProvider 출력 예시:

{
  "messages": [
    {"role": "user", "content": "What is the main topic of: '지미 카터는 미국의...'?"},
    {"role": "assistant", "content": "The text discusses: 지미 카터는 미국의... [fake-d4e5f6]"}
  ]
}

4단계: Ollama로 실제 QnA 생성

파이프라인이 올바르게 동작하면, Ollama를 사용하여 실제 QnA를 생성합니다.

manifest_wiki_ollama.yaml:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/wiki_ko_500.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 200
      max_length: 50000

  - id: dedup1
    type: dedup_exact
    slot: dedup
    input_type: TextDocument
    output_type: TextDocument

  - id: m1
    type: metrics_text_basic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"
      base_url: "http://localhost:11434"

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/wiki_qna.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

# Ollama 시작
ollama serve
ollama pull gpt-oss:20b

# 실행
eulerweave run manifest_wiki_ollama.yaml \
  --input data/wiki_ko_500.jsonl \
  --artifacts ./artifacts

실제 QnA 출력 예시:

{
  "messages": [
    {"role": "user", "content": "지미 카터 대통령은 몇 대 대통령인가요?"},
    {"role": "assistant", "content": "지미 카터는 미국의 제39대 대통령입니다. 1977년부터 1981년까지 재임했습니다."}
  ]
}

시나리오 C: CSV 데이터셋

HuggingFace의 일부 데이터셋은 CSV로도 다운로드할 수 있습니다.

CSV로 다운로드

# download_csv.py
from datasets import load_dataset

ds = load_dataset("tatsu-lab/alpaca", split="train")
ds.select(range(1000)).to_csv("data/alpaca_1k.csv")
print("Saved to data/alpaca_1k.csv")

CSV 매니페스트

version: 1
track: sft

inputs:
  - type: csv
    uri: data/alpaca_1k.csv
    options:
      text_column: instruction    # 텍스트로 사용할 컬럼 지정
      delimiter: ","

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/alpaca_csv_sft.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

실전 예제: 코드 데이터셋 → 코딩 QnA

코드 관련 데이터셋에서 코딩 QnA를 생성하는 예제입니다.

데이터 다운로드

# download_code.py
from datasets import load_dataset
import json

ds = load_dataset("sahil2801/CodeAlpaca-20k", split="train")

with open("data/code_alpaca.jsonl", "w") as f:
    for row in ds.select(range(500)):
        json.dump({
            "instruction": row["instruction"],
            "input": row.get("input", ""),
            "output": row["output"],
        }, f, ensure_ascii=False)
        f.write("\n")

print("Saved 500 records to data/code_alpaca.jsonl")

매니페스트

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/code_alpaca.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 30

  - id: m1
    type: metrics_text_basic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/code_sft.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

python download_code.py
eulerweave validate manifest_code.yaml
eulerweave run manifest_code.yaml --input data/code_alpaca.jsonl --artifacts ./artifacts

# 결과 확인
head -1 out/code_sft.jsonl | python -m json.tool

build_sft_messages vs build_langextract_qna

어떤 블록을 사용할지 결정하는 기준:

기준	`build_sft_messages`	`build_langextract_qna`
소스 데이터	instruction-output 쌍이 이미 존재	원시 텍스트만 존재
LLM 필요	아니요	예 (Ollama 또는 FakeProvider)
속도	빠름 (필드 매핑만)	느림 (LLM 호출)
품질	원본 데이터에 의존	LLM 능력에 의존
적합한 데이터	Alpaca, ShareGPT, Dolly 등	위키백과, 논문, 매뉴얼, 뉴스
프로필 설정	`allow_external_llm: false` 가능	`allow_external_llm: true` 필요

자세한 비교는 튜토리얼 4: SFT 트랙 심화를 참고하세요.

전체 과정 요약

# 1. HuggingFace datasets 설치
pip install datasets

# 2. 데이터 다운로드 (Python 스크립트)
python download_alpaca.py   # 또는 download_wiki.py

# 3. 매니페스트 작성

# 4. 검증
eulerweave validate manifest.yaml

# 5. 실행
eulerweave run manifest.yaml --input data/... --artifacts ./artifacts

# 6. 결과 확인
wc -l out/result.jsonl
head -3 out/result.jsonl | python -m json.tool

다음 단계

튜토리얼 2: PDF → 훈련 데이터 — 로컬 PDF 문서 처리
튜토리얼 4: SFT 트랙 심화 — 세 가지 빌더 블록 상세 비교
튜토리얼 7: MDS 내보내기 — 스트리밍 훈련용 MDS 형식 내보내기