튜토리얼 10: Perplexity 메트릭 — HuggingFace Transformers 기반 텍스트 품질 점수

metrics_perplexity 블록은 HuggingFace Transformers를 사용하여 텍스트별 perplexity(혼잡도)를 계산합니다. C4, FineWeb 등의 데이터셋 정제에서 핵심 품질 필터로 사용되는 지표입니다.

사전 요구 사항: 튜토리얼 8: 메트릭 블록을 완료했어야 합니다.

Perplexity란?

Perplexity는 언어 모델이 텍스트를 예측하는 난이도를 나타냅니다:

낮은 PPL (10~100): 자연스럽고 일관된 텍스트
중간 PPL (100~500): 도메인 특수 용어, 코드, 또는 혼합 언어
높은 PPL (500+): gibberish, 노이즈, 또는 형식이 깨진 텍스트

공식: PPL = exp(cross_entropy_loss)

1단계: 의존성 설치

# PyTorch + Transformers 설치
pip install torch transformers

Perplexity 블록은 AutoModelForCausalLM과 AutoTokenizer를 사용하여 모델을 로컬에서 직접 로드합니다. 별도의 서버가 필요하지 않습니다.

권장 모델

모델	HuggingFace ID	크기	용도
Qwen3-0.6B	`Qwen/Qwen3-0.6B`	~1.2GB	기본값, 경량 PPL 계산
GPT-2	`gpt2`	~500MB	빠른 테스트, 영어 전용
Qwen3-1.7B	`Qwen/Qwen3-1.7B`	~3.4GB	더 정확한 PPL (GPU 권장)

2단계: 매니페스트에 Perplexity 블록 추가

version: 1
track: sft
inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: m_ppl
    type: metrics_perplexity
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument
    params:
      model: Qwen/Qwen3-0.6B
      device: auto
      max_chars_per_doc: 2048
      sample_rate: 0.1       # 대규모 데이터에서는 10% 샘플링
      sample_max_records: 1000

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: TextDocument
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

파라미터 설명

파라미터	기본값	설명
`model`	`Qwen/Qwen3-0.6B`	HuggingFace 모델 ID
`device`	`auto`	연산 장치: `cpu`, `cuda`, `auto`
`max_chars_per_doc`	`2048`	문서당 최대 처리 문자 수
`sample_rate`	`1.0`	샘플링 비율 (0.0~1.0]
`sample_max_records`	(없음)	최대 샘플 레코드 수
`seed`	(없음)	샘플링 재현용 시드

3단계: 실행 및 결과 확인

eulerweave run manifest.yaml --input data/train.jsonl

결과 아티팩트:

artifacts/{run_id}/
  metrics/
    metrics_perplexity.json

결과 예시

{
  "metric_name": "metrics_perplexity",
  "version": "1.0",
  "computed_at": "2026-02-26T12:00:00Z",
  "summary": {
    "record_count": 10000,
    "analyzed_count": 1000,
    "avg_perplexity": 45.2,
    "p50_perplexity": 38.1,
    "p95_perplexity": 120.5,
    "max_perplexity": 350.0,
    "avg_loss": 3.81,
    "model_used": "Qwen/Qwen3-0.6B"
  },
  "warnings": [],
  "params": {
    "sample_rate": 0.1,
    "seed": null,
    "sampled_count": 1000
  }
}

4단계: CI 테스트에서 사용

CI에서는 torch/transformers가 설치되어 있지 않을 수 있으므로, FakePerplexityScorer를 주입하여 테스트합니다. 이 스코어러는 텍스트 해시 기반으로 결정론적 perplexity를 생성합니다:

from eulerweave.providers.perplexity import FakePerplexityScorer
from eulerweave.blocks_builtin.metrics_perplexity import MetricsPerplexityBlock

# CI 테스트: FakeScorer 주입
block = MetricsPerplexityBlock(scorer=FakePerplexityScorer())
result = block.compute(records)

assert result.summary["model_used"] == "fake-scorer"
assert result.summary["avg_perplexity"] > 0

Optional E2E 테스트 (실제 Transformers 모델 필요):

EULERFLOW_E2E_OLLAMA=1 pytest tests/integration/test_e2e_metrics_expansion.py -v -k ollama

전체 메트릭 파이프라인 예시

4개의 새 메트릭을 모두 포함하는 종합 데이터 품질 파이프라인:

version: 1
track: sft
inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument

  - id: dedup1
    type: dedup_exact
    slot: dedup
    input_type: TextDocument
    output_type: TextDocument

  - id: m_basic
    type: metrics_text_basic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m_quality
    type: metrics_quality_heuristic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m_rep
    type: metrics_text_repetition
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m_gib
    type: metrics_text_gibberish
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m_bp
    type: metrics_text_boilerplate
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: m_ppl
    type: metrics_perplexity
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument
    params:
      model: Qwen/Qwen3-0.6B
      device: auto
      sample_rate: 0.1

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

트러블슈팅

증상	원인	해결
`Compilation error: empty model name`	`model: ""` 설정	유효한 HuggingFace 모델 ID 지정 (예: `Qwen/Qwen3-0.6B`)
`Compilation error: invalid device`	`device`가 `cpu`/`cuda`/`auto`가 아님	허용된 device 값 사용
`Compilation error: unknown params`	인식되지 않는 파라미터 사용	허용 파라미터만 사용 (위 표 참조)
`ImportError: torch`	PyTorch 미설치	`pip install torch transformers`
`model_used: "fake-scorer"`	FakeScorer 사용 중	`pip install torch transformers` 후 실제 scorer 사용

다음 단계

튜토리얼 8: 메트릭 블록 — 전체 메트릭 블록 목록
튜토리얼 9: 원격 입력 — HuggingFace/HTTPS 데이터 소스
아키텍처: 엔진 — PerplexityScorer 내부 동작