튜토리얼 12: 불량 레코드 디버깅

데이터셋에 빈 텍스트, null 필드, 깨진 레코드가 포함되면 학습 품질이 저하됩니다. metrics_record_schema_validate 블록으로 이런 불량 레코드를 자동 감지하고 디버그할 수 있습니다.

사전 요구 사항: 튜토리얼 8: 메트릭 블록을 완료했어야 합니다.

1단계: 스키마 검증 블록 추가

version: 1
track: sft
inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: m_schema
    type: metrics_record_schema_validate
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument
    params:
      fail_threshold_ratio: 0.2   # 20% 이상 불량이면 경고

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

2단계: 실행 및 결과 확인

eulerweave run manifest.yaml --input data/train.jsonl

메트릭 결과:

cat artifacts/{run_id}/metrics/metrics_record_schema_validate.json

{
  "metric_name": "metrics_record_schema_validate",
  "summary": {
    "record_count": 1000,
    "valid_count": 985,
    "invalid_count": 15,
    "valid_ratio": 0.985,
    "invalid_sample_count": 15
  },
  "warnings": []
}

3단계: 불량 레코드 샘플 확인

invalid_count > 0이면, 아티팩트에 불량 레코드 샘플이 저장됩니다:

cat artifacts/{run_id}/debug/invalid_records.jsonl

각 라인은 하나의 불량 레코드로, record_id, reason 필드를 포함합니다:

{"record_id": "r042", "reason": "text is empty"}
{"record_id": "r187", "reason": "text is None"}

최대 100개의 불량 레코드가 샘플링됩니다.

4단계: fail_threshold_ratio 조정

fail_threshold_ratio는 경고를 발생시키는 불량 비율 기준입니다:

값	의미
`0.0`	불량 레코드가 하나라도 있으면 경고
`0.2` (기본값)	20% 초과 시 경고
`1.0`	경고 비활성화

params:
  fail_threshold_ratio: 0.05   # 5% 초과 시 경고 (엄격)

검증할 수 있는 항목

metrics_record_schema_validate는 다음을 검증합니다:

text 필드가 None이 아닌지
text 필드가 빈 문자열이 아닌지

향후 확장 예정: metadata 타입 검증, 필수 annotation 키 검사.

다음 단계

튜토리얼 8: 메트릭 블록 — 메트릭 블록 전체 가이드
튜토리얼 11: PII 안전 — PII 탐지 및 마스킹
아키텍처: 엔진 — 블록 실행 내부 동작